Software Instrument

The “Orchestra of Speech” instrument system

Part of the research method in this project was to develop software tools for analysing, extracting and orchestrating musical features from speech.

In order to use such tools in an improvised musical setting, the software system needed to work in real time, and be playable like an instrument. I knew it was going to be based on analyses of speech and have some means for generating sound based on these analyses, but other than that I could not really foresee exactly which features would turn out to be interesting beforehand. In that sense, the development of this digital musical instrument was different from building a system where the functional requirements are known in advance. The software development process became an integral part of the musical exploration of different ideas and approaches to speech as source material for music making. This changing and evolving nature is common for many such complex digital performance instruments where new ideas are developed and tested continuously (Trifonova, Brandtsegg, & Jaccheri, 2008).


The software instrument has been developed in the popular Max graphical programming environment. The system has made heavy use of some external Max libraries of analysis and processing tools developed at the Institut de Recherche et de Coordination Acoustique/Musique (IRCAM), at first the FTM (“Faster Than Music”) library (Schnell, Borghesi, & Schwarz, 2005) and later including the MuBu (“multi-buffer”) external library (Schnell, Röbel, Schwarz, Peeters, & Borghesi, 2009).

An overview of the complete software instrument system is presented below. For detailed explanations of the actual implementation in Max, please refer to the commented Max patches available from the “downloads” page.

System Overview

A digital composed instrument such as this can be viewed as composed of many processes that can be classified in terms of their function, such as analysis, transformation, and synthesis (Schnell & Battier, 2002). This is a useful division that has served as a guideline for my modular approach to building a playable digital instrument. In this instrument, speech recordings fulfil the function of performer gestures traditionally used as instrument input. The performer in turn controls how these recordings are translated into musical expressions. A helpful concept in this regard is the metaphor of “orchestration”, describing the process of arranging and distributing the musical structures extracted from the speech material.

System demonstration #1 overview:

In the diagram below, a schematic overview of the whole system shows the different parts and their function.

Input: speech recordings

At the core of the system is a pair of buffers for loading or recording collections (corpora) of speech recordings. The recordings can be analysed and automatically segmented into syllables (or more precisely: vowels), stressed syllable motifs and breath length phrases. For all segments some basic musical descriptors are calculated: mean pitch, segment duration, amplitude, pitch slope and tempo. Collections of analysed recordings can be saved as corpora files that can be swapped on the fly during performance, eliminating the need to perform such segmentation every time the sound set is needed.

Machine Learning

The system includes a simple method for machine learning through the use of Markov models (for details, see this this blog post on machine learning). When a corpus is loaded, the segments’ descriptors are used to generate first order Markov models that describe the likelihood of any transition between two states in the corpus: between different pitches, durations, amplitudes, pitch slopes and tempi. These statistical models can be used for proposing likely transitions, generating alternative but statistically probable sequences of segments. This is described further below.


The segments of the corpus currently loaded into the main buffer can be played back by two identical player devices, which can be operated separately, or synchronized to create shadowing effects. Playback is handled as pitch synchronous granular synthesis (Roads 1996, p.174), which means that it plays “sound grains” of one wave period at a time, at the frequency defined as the fundamental pitch. As a reference for this fundamental frequency it uses a pitch track generated and stored in the buffer as part of the initial offline analysis and segmentation. The pitch synchronous granular method allows decoupling of playback speed and pitch, as the grain rate can be treated separately from the fundamental frequency indicated by the original playback position. This enables playback speed transformations like time stretching and compressing, independently from frequency transformations like pitch and spectrum transpositions. It also allows independent changes to the timing of segment onsets, making changes to the overall speech rate or tempo possible.

System demonstration #2 playback device:

Playback control

Segments can be navigated as dots in a graphical user interface, displayed in a scatter plot as a function of mean pitch (y) tempo/duration (x) and amplitude (colour). Number of voices available for polyphonic playback can be set from one to eight, and a quantity measure allows up to 10 of the closest segments to be triggered at the same time as a cluster or a sequence (depending on the number of available voices), optionally repeating in a semi-random order. If however the Markov chain mode is enabled, each triggered segment will query the Markov model for probable transitions for each of the segment’s descriptors. These probable transitions are in turn used to find and trigger the closest matching segment in the corpus (this approximation enables output even when no exact matching segment is found, thus avoiding the dead ends typically a problem with Markov chains). In effect, this function generates Markov chains of alternative but statistically probable sequences of segments. The degree of freedom for these sequences can further be limited with an imposed measure of continuity, influencing which descriptors are prioritized when searching for the closest match. Ranging from 0% continuity where all musical descriptors are weighted equally, through 50% where only pitch and duration are weighted, towards a state at 75% where the file index is weighted in thus limiting segment choice to only those of the same recording. When the continuity measure is set fully to 100%, only the file and segment indices are weighted, forcing a perfectly continuous playback of segments from a single file in their original order. This way, by changing the continuity measure one can move gradually from a linear sequence of segments to a probabilistic sequence, as demonstrated from 1:17 in the video demonstration above.

Continuous analysis

The sound output from the playback device can be fed into an online analysis stage, performing real time segmentation and analyses of spectrum, fundamental pitch, vocal effort and tempo.

Devices overview

Synthesis devices

Due to the modular layout, the resulting stream of continuous analysis data can be used collectively by a number of synthesis devices, making it possible to create different simultaneous layers of orchestration from the same input stream. In the present setup that includes an additive synthesizer, a midi note output device and a hybrid audio/midi percussive device.

The modular layout also means that these devices have separate but similar controls for the transformation, synthesis and output of the same incoming continuous analysis stream as shown in the diagram above.

System demonstration #3 synthesis device:

System demonstration #4 MIDI device:

System demonstration #5 percussive device:


The sound of all playback and synthesis devices can either be mixed to a stereo output or routed to multiple output channels. In the current setup, 16 channels are used, connected to a set of stereo loudspeakers, two small radios, and 12 transducers attached to acoustic instruments resulting in a hybrid “electric/acoustic” sound. This is reflected in a panning interface where the sound position can be controlled, not as an exact position in the room, but sent to the different speakers and instruments as a kind of direct orchestration of the sound output. In line with the metaphor of orchestration, this “orchestra” of loudspeakers and acoustic instrument-speakers is organised into sections of different instrument classes: speakers, drums, strings and cymbals, with four instruments/speakers in each group. The interface includes one two-dimensional control parameter for panning between the four sections, and another for panning within sections. Thus, with these two interface controls one can gradually move the sound between any of the 16 outputs of the system (this interface is shown at 2:54 in the video demonstration of the playback device above).

In addition, an “autopan” function can change the pan position automatically on every segment onset if its range is set to more than 0%. Up to 50% it will limit the new position to within the same instrument group, while over 50% the autopan funtion will increasinlgy pan across all outputs.

MIDI channel routing is handled in the same way, sending MIDI notes to a backend of 16 software instruments (for convenience hosted in the popular music production software Ableton Live). The audio output from these software instruments is routed to the same multichannel setup of transducers and speakers, resulting in a coherent layout for controlling multichannel pan position for all devices in the system.

Live audio input

In addition to the signal flow described above, it is also possible to feed live audio input into the system. This input can be used in three different ways. Sound can be recorded in one of the buffers and instantly used for playback. It can also be used to trigger playback of segments that are already in the buffers, either by finding the most similar segment (mirror mode), or by triggering a response by querying the Markov model for a statistically probable continuation of the input segment (query mode). Alternatively, audio input can be routed directly into the online analysis stage for instant live orchestration by the synthesis devices.

Control and automation

The system can be controlled through a conventional graphical user interface (GUI) with dials, sliders and switches. Physical MIDI controllers can be connected and mapped to these control parameters by an automatic learning function. When activated, this function listens for MIDI input and maps the active incoming controller number to the last active GUI parameter. The same centralized parameter subsystem is used for storing and recalling presets both globally and for each device in the system. Through accessing this subsystem it is also possible to enable complete automation of any parameter in the system by scripting cue files, a feature that was implemented in order to use the system in a complete self-playing mode as a sound installation.

← Previous page: Results Next page: Music


Roads, C. (1996). The computer music tutorial. Cambridge, Mass.: MIT Press.

Schnell, N., & Battier, M. (2002). Introducing Composed Instruments, Technical and Musicological Implications. In Proceedings of the international conference on new interfaces for musical expression (pp. 156–160). Dublin, Ireland.

Schnell, N., Borghesi, R., & Schwarz, D. (2005). FTM : Complex Data Structures For Max/Msp. In Proceedings of International Computer Music Conference (ICMC).

Schnell, N., Röbel, A., Schwarz, D., Peeters, G., & Borghesi, R. (2009). MUBU & friends – Assembling tools for content based real-time interactive audio processing in MAX/MSP. In Proceedings of International Computer Music Conference (pp. 423–426).

Trifonova, A., Brandtsegg, Ø., & Jaccheri, L. (2008). Software engineering for and with artists: a case study. In Proceedings of the 3rd ACM International Conference on Digital Interactive Media in Entertainment and Arts (DIMEA’08) (Vol. 349, pp. 190–197). Athens, Greece.

← Previous page: Results Next page: Music