Technical background

This appendix provides some additional technical background for the particular signal processing techniques implemented in the software instrument system, as described in the instrument overview.

Signal Processing

A wide range of available approaches for managing and processing sound in general and speech in particular is relevant in this regard. Primarily from the field of electronic music but also from the fields of speech signal processing used in telephony, speech recognition, speech modelling and linguistics. As much more detailed descriptions of various common techniques for signal processing can be found in other well written sources like “The computer music tutorial” by Curtis Roads, only brief descriptions will be given here.


Particularly useful when dealing with a defined source material like speech is the analysis/resynthesis approach, which includes a range of different ways to analyse, process and then resynthesize sound from a given source (Roads, 1996).

Linear Predictive Coding

One such approach is the source-filter model of speech production, that has proved useful in speech signal processing (Fant, 1960). According to this model, speech is treated as a combination of a source signal (the vibrating vocal cords/glottis pulse train) and an acoustic filter (the vocal tract, including throat, mouth, lips and nasal cavity). The filter part has typically been approximated with the linear predictive coding (LPC) analysis technique that makes an estimation of the filter spectrum based on the difference between the (relatively) slow movements of the filter relative to the much faster pulses of the glottis (Atal & Hanauer, 1971). This simplified model of the speech organs allows analysis and treatment of voicing separate from the formants – the peaks in the spectrum that characterises different vowels. Because of its compact representation it has long been in use to encode and synthesise speech digitally with a low bitrate, for instance in voice over IP (VoIP) telephony. The LPC technique was also picked up early by composers working with speech and computers, such as Charles Dodge and Paul Lansky.

Fourier Transform

An analysis/resynthesis technique more commonly used in musical applications is the short-time spectrum. Typically obtained through the fast Fourier Transform (FFT), which transforms short slices of sound waves (air pressure variations in the time domain) into frames of frequencies and amplitudes (amplitude differences in the frequency domain). This information can in turn be used for additive synthesis of individual partials allowing a wide range of processing techniques that affects both partial frequencies and spectral shape (Roads, 1996).


Another way of approaching the source-filter division is the technique of cepstral smoothing (Smith III, 2014). In this technique, another Fourier transform is performed on a log-scale representation of the short time spectrum itself, resulting in a kind of a spectrum of the spectrum. This imaginary domain has been dubbed the cepstrum, which is just an anagram of the word ‘spectrum’ (Bogert, Healy, & Tukey, 1963). One can view the cepstrum as a description of the shape of the original spectrum, as if the spectrum was a signal frame in the time domain. Filtering out higher bins (called quefrencies) in this cepstrum, and inverse-Fourier transforming it back to the spectral domain, results in a smoothed spectrum (less jagged and with fewer peaks), which like the LPC spectrum can be used as a filter or for detecting formants.

Another cepstrum-based technique that must be mentioned in relation to speech processing is that of mel-frequency cepstrum coefficients, known under the acronym MFCC (Mermelstein, 1976). A MFCC is the cepstrum of the mel-spectrum, which is a spectrum with an alternative frequency scale better suited to represent the formant regions most important for speech. It is a very robust and compact way to describe only those parts of the spectrum that is important for discerning phonemes, and therefore very common in automatic speech recognition applications. The MFCC technique is very powerful as a spectral descriptor, but in the analysis/synthesis approach adapted in this project it has been used only tentatively, and mostly explored in relation to syllable segmentation.


Regarding the synthesis stage of this approach, resynthesizing the processed results from such analyses back into sound can be done in several ways as well. In particular, the overlap-add (OLA) technique of resynthesizing signal slices, obtained through the inverse FFT of spectral frames, has proved an efficient way of synthesizing large numbers of partials and noise components at the same time (Rodet & Schwarz, 2007). This technique, in addition to its pitch-synchronous variant (PSOLA) (Moulines & Charpentier, 1990), allows for a wide range of possible transformations and abstractions of the same input/output-chain and is one of the main synthesis techniques used in this system.

Corpus approach and machine learning

Segments as coordinates in a space of mean pitch (vertical) speech rate (horizontal) and vocal effort (colour)

Segments as coordinates in a space of mean pitch (vertical) speech rate (horizontal) and vocal effort (colour)

In addition to such signal processing techniques, some overall approaches for organising recordings and data have also been influential in the development of this instrument. In the statistical approaches widely adopted in corpus linguistics and speech recognition applications, large numbers of recordings are organised as whole bodies – corpora – of analysed segments. By looking at the corpus as a whole, the relationships between its elements can more easily be explored. Such approaches have been applied successfully in digital musical instruments as well, as in the audio mosaicking and concatenative synthesis techniques developed in Diemo Schwarz’ “CataRT” instrument (Schwarz, Beller, Verbrugghe, & Britton, 2006).

This opens up for a much more musical way of using this material, with variable degree of fragmentedness and removedness from the original speech structures. One possibility is the exploration of fragments that occupies the same area in the prosodic space, creating sequences that make more sense musically than based on the lexical content and thus shifting listening focus to their musical structures. This can involve repetition and progressive variation of shorter or longer segments, more in line with a typical musical exploration of this material.

Excerpt from database of analyzed segments

Excerpt from database of analyzed segments

An extension of the database approach is the use of machine learning typically found in automatic speech recognition. Machine learning is a huge field by itself, also for pattern recognition and generation in interactive music systems. It was never meant to be the main focus of this project, it has proved a useful influence for introducing improvisational elements like interactivity and the unknown response into this project.

← Previous page: Essay on Artistic research Next page: Technical background


Atal, B S, and Suzanne L Hanauer. 1971. “Speech Analysis and Synthesis by Linear Prediction of the Speech Wave.” Journal of the Acoustical Society of America 50: 637–55.

Bogert, Bruce P, Michael J R Healy, and John W Tukey. 1963. “The Quefrency Alanysis of Time Series for Echoes: Cepstrum, Pseudo-Autocovariance, Cross-Cepstrum and Saphe Cracking.” In Proceedings of the Symposium on Time Series Analysis, , 209–43.

Fant, Gunnar. 1960. Acoustic Theory of Speech Production: With Calculations Based on X-Ray Studies of Russian Articulations. The Hague, Netherlands: Mouton.

Mermelstein, Paul. 1976. “Distance Measures for Speech Recognition, Psychological and Instrumental.” Pattern Recognition and Artificial Intelligence 116: 374–88.

Moulines, Eric, and Francis Charpentier. 1990. “Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones.” Speech Communication 9(5–6): 453–67.

Roads, Curtis. 1996. The Computer Music Tutorial. Cambridge, Mass.: MIT Press.

Rodet, Xavier, and Diemo Schwarz. 2007. “Spectral Envelopes and Additive + Residual Analysis/Synthesis.” In Analysis, Synthesis, and Perception of Musical Sounds, , 175–227.

Schwarz, Diemo, Grégory Beller, Bruno Verbrugghe, and Sam Britton. 2006. “Real-Time Corpus-Based Concatenative Synthesis with Catart.” In 9th Int. Conference on Digital Audio Effects (DAFx), Montreal, Canada, 279–82.

Smith III, Julius O. 2014. “Cross Synthesis Using Cepstral Smoothing or Linear Prediction for Spectral Envelopes.” Retrieved December 4, 2017, from

← Previous page: Essay on Artistic research Next page: Resources