Project description
Context: Language and Speech
Context: Music
Context: Technology

Project description

The topic for this project is the relationship between music and speech, in particular improvised music and everyday conversation. From a creative musicians point of view, it explores how features of speech can be used as a source for improvised music. The main method for this exploration has been the development of a new digital musical instrument for “orchestrating” speech into music in real time. The artistic aim has been to develop an alternative improvisational foundation for making music that is closely related to the genuine human musicality inherent in spoken language.

Context: language and speech

In a very general sense, the project is based on the idea that spoken language and music are closely related and probably share evolutionary origins, and that some aspects of creating and experiencing music can be related to the communicative role of musical features in speech. Such ideas have been explored from several perspectives in a growing literature on the evolutionary origins and functions of art and ritual in recent decades, like in the interdisciplinary field mapped out by Ellen Dissanayake and others through compilations like Communicative Musicality (S. Malloch & Trevarthen, 2009).

The topic then, is not what we say, but how we say it – how the intonation, register, tempo, rhythm, dynamics, and voice quality form a communicative layer of its own in speech. This is what linguists call prosody (from Greek: towards song), and this music of everyday speech constitutes a huge semantic potential that with or without our knowing expresses our state of mind, our intentions, expectations, attitudes, relations, feelings, notions and views, and which in hermeneutical ways affect how our utterances are interpreted.

As the linguistic fields of prosody and conversation analysis show, these features have obvious pragmatic functions for helping structuring conversation (Szczepek Reed, 2011; Wennerstrom, 2001). But while linguist look at how such prosodic features are used for negotiating turn taking or highlighting new information, it is from a musical point of view interesting to see if these structures – nuanced and intuitively meaningful vocal gestures – also can make sense as recognizable patterns in music. In this regard, improvised music can be viewed as a particularly close parallel to conversation, as both involve a continuous dialogical negotiation of content and development, interpreting intentions with many of the same means and mechanisms.

Another interesting aspect of speech from a musical point of view is how prosodic features express social relationships. This is what the philosopher Mikhail Bakhtin referred to when discussing speech genres, stylistic templates that we tend to use as formal frameworks when constructing utterances on the fly (Bakhtin, 1986). Just like literary genres, they include choice of style and wording, but speech genres also include specific prosodic traits like the use of certain registers, dynamic ranges, vocal effort, tempi etc. Taken together such traits form a musical character that communicates something about the social situation and thus provides a context for interpretation. For instance, the metric regularity of speech conveys something about the social distance, with very dynamic ‘tempo rubato’ being typical of close, subjective, private conversations, while strict regular timing is used when referring something objective, impersonal and formal (Leeuwen, 1999). Other significant genre characteristics typically include speech rate (tempo), register (mean pitch), voice quality, loudness, phrase and pause length, dynamic variation, etc.

We intuitively use different genres in different social situations, such as talking to children, to a judge, to a lover, to an audience or a reporter on TV. The genres are natural part of the everyday social characters we take on, and only stick out when used differently from what is expected, like the patronizing way of talking to adults as if they were children. According to Bakhtin, there are as many potential speech genres as there are social relations. Small talk, pillow talk, baby talk, interrogation, public address, report, confession, etc. are only some examples of speech genres where the form conveys an important part of the social meaning of an utterance.

Bakhtin’s emphasis on genres derived from his view on language that words do not have any meaning by themselves – it is how they are used in an utterance, with a specific context, that actually provides the meaning. Thus speech genres convey a social meaning that we seem very attentive to, and that is expressed essentially through musical parameters.

A musical exploration of such speech genres is one of the main themes in this project.

Context: music

To make the connection between music and language is nothing new. In Europe during the 17th century in particular, music was increasingly seen in connection with Antiquity’s highly developed art of rhetoric (Bartel, 1997; Bonds, 1991). Music theory books from the period show how much these ideas of rhetoric influenced German baroque music (Mattheson, 1739), and that this music speaks is something that early music pioneers later have pointed to as a key for interpreting and performing this kind of music (Harnoncourt, 1982). Speech has never been far away in the recitatives of the Opera genre either, and is present in the music of some composers like the speech-melodies of Janáček, and in the sprech-gesang developed later by composers such as Schönberg, Berg and Webern.

Nevertheless, it is primarily during the last 60-70 years that the availability of sound recording technology has made possible a much more extensive musical exploration of speech and the voice. British composer and academic Cathy Lane has given an overview of the many approaches and contributors in this field (Lane, 2006), many of which also feature in the compilation “Playing with words” (Lane, 2008).

Lane identifies several distinct approaches to use speech and voice in music, from pure documentaristic pieces, montages (e.g. the radiophonic pieces of Glenn Gould), performative explorations of language and the voice (Aperghis, Berio, Ligeti), sound poetry (Schwitters, Jaap Blonk), different ways of electronically transforming recorded speech and song (Herbert Eimert, Stockhausen, among others) and the use of speech fragments as melodic motives (Steve Reich). Trevor Wishart in particular has explored many aspects of the voice in his compositions, such as sonic transformations (Red Birds), the voice as icon of personality and identity (Two Women, American Triptych), phonetic elements as musical material (Tongues of Fire, Globolalia) etc., and has also written extensively on composing using the expressivity of the human voice (Wishart, 1994, 1996, 2012). Other approaches includes the connection between sound and text explored by the Swedish tradition of Text-Sound composition after the likes of Lars-Gunnar Bodin (Brunson, 2009). Others have used speech directly as a melodic source, such as Paul Lansky, Paul DeMarinis, Robert Ashley, Scott Johnson, Florent Ghys, Jacob ter Veldhuis, Michael Vincent (Vincent, 2010) and jazz pianist Jason Moran. Others have used speech transcribed in various other ways into acoustic instrumental music, like the spectral analyses of speech used by Jonathan Harvey in his 2008 orchestra piece “Speaking”.

Recently, interesting new technological approaches have also been developed, such as the analysis, modelling and transformation of speech expressivity in relation to gesture by Grégory Beller and others in the speech research community at IRCAM (Beller, Schwarz, Hueber, & Rodet, 2005) (Beller, 2009).

Relevant for my project’s emphasis on improvisation is the music and research of pianist Sten Sandell, who from the perspective of a performer has used speaking as an integral part of improvised piano performances (Sandell, 2011, 2013).

Another relevant reference is the music of Peter Ablinger, especially his cycle of “Voices and Piano” pieces and his use of a mechanical piano to render speech.


These are some of the multitude of ways that speech has been used in music. Since speech and music are universal human phenomena, and thus can be related to many different aspects of human experience, a large number of interesting perspectives are possible. So even if the subject of speech is similar, the particular focus can be quite different.

The focus of Ablinger for instance is on the representation of reality. He has described his use of the mechanical piano as imposing a grid on the sonic reality, a phonorealistic music as an analogy to photorealistic painting (Ablinger, n.d.). His voice pieces have the additional character of musical portraits of famous historical persons, placing the emphasis on personal idiosyncrasies, individual stories and shared cultural history. Sten Sandell on the other hand focuses on the act of speaking primarily as a performer, equating speaking with playing as two possible outcomes of the same improvisational impulse. While Wishart has treated a particularly wide range of aspects of speech in his compositions, the focus is often on the sound and the voice as a much wider phenomenon than just speech. In the piece “Encounters in the Republic of Heaven”, which with its focus on everyday speech comes close to the approach of this project, there is also the overall concept of a voice portrait of the local community in Yorkshire.

To explain my own musical approach to this topic of speech, it is perhaps necessary to detail my musical background. Educated as a performer of the piano and Hammond organ, I have worked mostly as an improviser and composer in jazz and contemporary music genres. One background for this project has been the experience as a performer that many of the things going on in improvised interplay are quite similar to the dynamics of spoken conversation. Not only analogous or metaphorically similar, but at times actually the same, like for instance the linguistic concept of backchannels – short responses such as “uh huh” or “yeah” to affirm and acknowledge that one follows the contributions of fellow speakers, similar to the compingfigures often used for the same purpose in jazz improvisation. In a previous project of developing a personal contemporary idiom for the Hammond organ, I used improvised musical dialogues with unusual instrument combinations as a method for provoking ideas and come to new musical conclusions.

(Hammond B3 organ with string trio)
Excerpt from Hammond Dialogues vol 2: Twined

That experience led to the idea of using actual speech as material in improvised music, to further explore this connection and see how this could affect the perception of both speech and music. Rather than using stylized forms like recited lyrics or speeches, the focus of this project has therefore been on the improvised interaction of real life conversations. An additional focus has been on speech genre as social context and musical character, and one aim has been to make the connection between conversation and musical improvisation as similar modes of communicative interplay. Another important focus that emerged during the project was how these topics can be integrated into an appropriate performance concept, bridging the sound realms of acoustic instrumental performances and virtual electroacoustic soundscapes of recorded speech.

This project can contribute a new perspective with its focus on improvisation in relation to everyday conversations, and by bringing this material back into improvised musical dialogues and thus highlighting improvisation as discourse and language-like process both in music and conversation.

Context: technology

There is also a technological context for a project that has chosen to develop a new digital musical instrument as its main method for exploring the speech/music relationship. Central to this exploration has been different ways of extracting and abstracting the musical traits of speech, generalizing them as sonic gestures.

A wide range of available approaches for processing and managing sound in general and speech in particular is relevant in this regard. Primarily from the field of electronic music but also from the fields of speech signal processing used in telephony, in automatic speech recognition, speech modelling and linguistics. As much more detailed descriptions of various common techniques for signal processing can be found in other well written sources like “The computer music tutorial” by Curtis Roads, only brief descriptions will be given here.

Signal Processing

Particularly useful when dealing with a defined source material like speech is the analysis/resynthesis approach, which includes a range of different ways to analyse, process and then resynthesize sound from a given source (Roads, 1996).

Linear Predictive Coding

One such approach is the source-filter model of speech production, that has proved useful in speech signal processing (Fant, 1960). According to this model, speech is treated as a combination of a source signal (the vibrating vocal cords/glottis pulse train) and an acoustic filter (the vocal tract, including throat, mouth, lips and nasal cavity). The filter part has typically been approximated with the linear predictive coding (LPC) analysis technique that makes an estimation of the filter spectrum based on the difference between the (relatively) slow movements of the filter relative to the much faster pulses of the glottis (Atal & Hanauer, 1971). This simplified model of the speech organs allows analysis and treatment of voicing separate from the formants – the peaks in the spectrum that characterises different vowels. Because of its compact representation it has long been in use to synthesise and encode speech with a low bitrate, for instance in voice over IP (VoIP) telephony. The LPC technique was also picked up early by composers working with speech, such as Charles Dodge and Paul Lansky.

Fourier Transform

An analysis/resynthesis technique more commonly used in musical applications is the short-time spectrum. Typically obtained through the fast Fourier Transform (FFT), which transforms short slices of sound waves (air pressure variations in the time domain) into frames of frequencies and amplitudes (amplitude differences in the frequency domain). This information can in turn be used for additive synthesis of individual partials, allowing a wide range of processing techniques that affects both partial frequencies and spectral shape (Roads, 1996).


Another way of approaching the source-filter division is the technique of cepstral smoothing (Smith III, 2014). In this technique, another Fourier transform is performed on the (log-scale) short time spectrum itself, resulting in a spectrum of the spectrum. This imaginary domain has been dubbed the cepstrum, which is just an anagram of the word ‘spectrum’ (Bogert, Healy, & Tukey, 1963). One can view the cepstrum as a description of the shape of the original spectrum, as if the spectrum was a signal frame in the time domain. Filtering out higher bins (called quefrencies) in this cepstrum, and inverse-Fourier transforming it back to the spectral domain, results in a smoothed spectrum (less jagged and with fewer peaks), which like the LPC spectrum can be used as a filter or for detecting formants.

Another cepstrum-based technique that must be mentioned in relation to speech processing is that of mel-frequency cepstrum coefficients, known under the acronym MFCC (Mermelstein, 1976). A MFCC is the cepstrum of the mel-spectrum, which is a spectrum with an alternative frequency scale better suited to represent the formant regions most important for speech. It is a very robust and compact way to describe only those parts of the spectrum that is important for discerning phonemes, and therefore very common in automatic speech recognition applications. The MFCC technique is very powerful as a spectral descriptor, but in the analysis/synthesis approach adapted in this project it has been used only tentatively, and mostly explored in relation to syllable segmentation.


Regarding the synthesis stage of this approach, resynthesizing the processed results from such analyses back into sound can be done in several ways as well. In particular, the overlap-add (OLA) technique of resynthesizing signal slices, obtained through the inverse FFT of spectral frames, has proved an efficient way of synthesizing large numbers of partials and noise components at the same time (Rodet & Schwarz, 2007). This technique, in addition to its pitch-synchronous variant (PSOLA) (Moulines & Charpentier, 1990), has allowed for a wide range of possible transformations and abstractions of the same input/output-chain in this project.

Corpus approach and machine learning

In addition to such signal processing techniques, some overall approaches for organising recordings and data have also been influential in the development of this instrument. In the statistical approaches widely adopted in corpus linguistics and speech recognition applications, large numbers of recordings are organised as whole bodies – corpora – of analysed segments. By looking at the corpus as a whole, the relationships between its elements can more easily be explored. Such approaches have been applied successfully in digital musical instruments as well, as in the audio mosaicking and concatenative synthesis techniques developed in Diemo Schwarz’ “CatArt” instrument (Schwarz, Beller, Verbrugghe, & Britton, 2006).

An extension of the database approach is the use of machine learning typically found in automatic speech recognition. Machine learning is a huge field by itself, also for pattern recognition and generation in interactive music systems. It was never meant to be the main focus of this project, but as I have described more detailed in this blog post on machine learning it has proved a useful influence for introducing improvisational elements like interactivity and the unknown response into this project.

← Previous page: Introduction Next page: Results


Ablinger, P. (n.d.). Voices and Piano program note. Retrieved December 1, 2017, from

Atal, B. S., & Hanauer, S. L. (1971). Speech Analysis and Synthesis by Linear Prediction of the Speech Wave. Journal of the Acoustical Society of America, 50, 637–655.

Bakhtin, M. M. (1986). The Problem of Speech Genres. In Speech Genres and Other Late Essays (pp. 60–102). Austin: University of Texas Press.

Bartel, D. (1997). Musica poetica : musical-rhetorical figures in German Baroque music. Lincoln: University of Nebraska Press.

Beller, G. (2009). Analyse et Modèle Génératif de l ’ Expressivité, Application à la Parole et à l’Interprétation Musicale. (Doctoral thesis). Universite Paris VI, Paris.

Beller, G., Schwarz, D., Hueber, T., & Rodet, X. (2005). Hybrid Concatenative Synthesis On The Intersection of Music and Speech. In Journees d’Informatique Musicale (pp. 41–45).

Bogert, B. P., Healy, M. J. R., & Tukey, J. W. (1963). The quefrency alanysis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking. In Proceedings of the symposium on time series analysis (Vol. 15, pp. 209–243).

Bonds, M. E. (1991). Wordless Rhetoric. Musical Form and the Metaphor of the Oration. Cambridge, Mass: Harvard University Press.

Brunson, W. (2009). Text-Sound Composition – The Second Generation. In Proc. of EMS-09 Conference on Electronic Music Studies.

Fant, G. (1960). Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations. The Hague, Netherlands: Mouton.

Harnoncourt, N. (1982). Musik als Klangrede : Wege zu einem neuen Musikverständnis. Salzburg: Residenz.

Lane, C. (2006). Voices from the Past: compositional approaches to using recorded speech. Organised Sound, 11(1), 3–11.

Lane, C. (Ed.). (2008). Playing with Words. London: CRiSAP.

Leeuwen, T. van. (1999). Speech, Music, Sound. London: Macmillan Press.

Malloch, S., & Trevarthen, C. (Eds.). (2009). Communicative musicality: Exploring the basis of human companionship. Oxford: Oxford University Press.

Mattheson, J. (1739). Der vollkommene Capellmeister. Hamburg: Christian Herold.

Mermelstein, P. (1976). Distance measures for speech recognition, psychological and instrumental. Pattern Recognition and Artificial Intelligence.

Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5–6), 453–467.

Roads, C. (1996). The computer music tutorial. Cambridge, Mass.: MIT Press.

Rodet, X., & Schwarz, D. (2007). Spectral Envelopes and Additive + Residual Analysis/Synthesis. In Analysis, Synthesis, and Perception of Musical Sounds (pp. 175–227).

Sandell, S. (2011). Music inside the Language [CD]. Steninge, Sweden: LJ Records.

Sandell, S. (2013). På insidan av tystnaden : en undersökning. (Doctoral thesis). Konstnärliga fakulteten, Göteborgs universitet, Göteborg.

Schwarz, D., Beller, G., Verbrugghe, B., & Britton, S. (2006). Real-time corpus-based concatenative synthesis with catart. In Proc. of the 9th Int. Conference on Digital Audio Effects (DAFx-06) (p. pp.279-282). Montreal, Canada.

Smith III, J. O. (2014). Cross Synthesis Using Cepstral Smoothing or Linear Prediction for Spectral Envelopes. Retrieved December 4, 2017, from

Szczepek Reed, B. (2011). Analysing conversation : an introduction to prosody. Basingstoke: Palgrave Macmillan.

Vincent, M. (2010). Music & Language Interrelations. (Doctoral thesis). University of Toronto.

Wennerstrom, A. (2001). The Music of Everyday Speech: Prosody and Discourse analysis. Oxford: Oxford University Press.

Wishart, T. (1994). Audible design : a plain and easy introduction to practical sound composition. Orpheus the Pantomime.

Wishart, T. (1996). On sonic art. New York: Routledge.

Wishart, T. (2012). Sound Composition. Orpheus the Pantomime.

← Previous page: Introduction Next page: Results