Instrument and performance

The Orchestra of Speech software system works as a playable musical instrument, but can also be thought of as an instrument in the scientific sense of the word, like a prism or microscope that can zoom in and extract different musical structures from speech.

It is not meant to represent the kind of ground-breaking technical innovation as, say, a new cutting edge synthesis technique or control paradigm, but must rather be viewed as a case study of how to put together – how to compose – a selection of already available techniques in order to realise a particular artistic vision. The innovation lies in the personal combination of ideas and system, and in the musical outcomes made possible by this particular combination.

For the same reason, this instrument is not meant to be a general-purpose musical instrument that anyone can pick up and play. Personal aesthetic notions underpin every decision in the design and development, and for this reason such compound instruments can often be hard to separate from the music making processes of which they are part.

As described in the overview of the software instrument, the system was developed from scratch with few initial specifications given. The features and the ideas about the instrument changed repeatedly in response to testing and performances, making any functional requirements a moving target. Nevertheless, over time it has stabilised into a set of features that seems to fulfil the most important artistic needs identified during development. That includes the functions to create several different layers from the same speech source, the functions for indeterminacy, and functions for using live sound input to provoke responses and engage in interplay.
Technical improvements are still possible both regarding particular features and how they are implemented. For the scope of this project however, I chose to prioritize other issues once the instrument worked satisfactorily. The software system was after all not the end goal of this project, but developed as an important tool for the overall purpose of exploring speech in an improvised musical setting.
Though fully functional, a first design such as this will still have the character of a prototype. In future work it might very well be developed further, or even rebuilt from scratch based on the functional requirements identified in this work.

Performance issues

The issues that have concerned me regarding performance have often been of a technical character – for example how to integrate the different functions into a manageable whole, and how to control the instrument through an intuitive interface. Achieving what Sidney Fels calls control intimacy (Fels, 2004) with software instruments is very much about embodiment, and using physical interfaces instead of the intangible virtual interfaces presented by computer screens. One of my early blog posts about developing interfaces and a later one about building custom controllers show more of my thoughts on this subject.

But in addition to these technical aspects of instrument design, one overall concern has been how to integrate the new approaches to music making presented by this project into a performance practice honed over many years as an improvising keyboard player. During early system testing I was essentially just playing back individual recordings of speech from beginning to end, trying as best as I could to keep up and make some interesting musical ‘translation’ on the fly. This felt insufficient. I was anxious that this whole approach to live “orchestration” was flawed, and that it was going to be more like superficial “remixing” than the creative process of exploratory improvisation I was used to as a musician.

When I started to use transducers attached to acoustic instrument as speakers, I found that the electronic computer instrument came closer to the sonic realm of physical acoustic instruments. It became much easier to relate to my role as a performer, and I could even engage in ‘dialogues’ on the piano. Another advancement was the ability to create several musical layers derived from the same speech stream, like in this organ study for three parts played back on different registers of a MIDI-controlled pipe organ:


But the system was still somewhat slow and impractical to operate, and not intuitive enough to be able to pursue new musical ideas appearing in the moment. To quickly change musical character I had to know exactly which recording to use. To speed up changes I started using precompiled lists of selected recordings that had the kind of character I wanted, but though this made navigation quicker it also narrowed the options considerably during performances.

A turn came when I started to organise recordings into corpora, databases with collections of analysed and segmented recordings. Then it became possible to sort and find recordings based on musical criteria like tempo and register, and much easier to navigate large collections of many different recordings. It was also much easier to juxtapose several recordings and make collages of similar segments on the fly. However, the most significant change with this corpus approach was the added ability of making statistical models. Such a model can be used to create alternative sequences that are statistically likely based on the original sequences, and therefore share some of the same overall characteristics.

Initially I had reservations about fragmenting the speech sounds, as I consider the particular timing of how events combine into phrases as an important characteristic of speech gestures. I had after all wanted to investigate the musical structures implied by the original speech structures, and not just dressing up my old musical ideas in fragmented speech sounds. But with these statistically probable sequences, the overall characteristics of the original timing and intonation patterns are actually preserved, even when the gesture itself is rearranged and consists of many shorter fragments. This is because no transition between any pitch, duration or other feature is used that does not already occur in the original sequences, and how often they occur is also determined by how widespread they are in the original material. This way, typical successions of long and short syllables, of high and low pitches etc. are carried on even when new sequences are generated.

If an alternative organisation more typical of music is adapted, such as cycles of repeated segments or collages of similar sounds, then a gradual transition between such music-associated organisation and plain speech organisation is still possible. This way such formal distinctions can be dealt with musically and reflected upon in the music itself.

Another important feature made possible by using analysed corpora and statistical models, was the ability to trigger segments with live sound input, using speech, song or even a musical instrument as a kind of acoustic “controller”. In mirror-mode, incoming, analysed sound segments can be used to trigger the most similar speech segments in the corpus, while in query-mode, live input segments can be used to query a statistical model instead, returning the most likely continuation based on sequences already in the corpus. In effect this creates a kind of rudimentary, dadaistic “speech recognition” system, listening and producing musically probable (while otherwise nonsense) utterances in response to live sound input.

With these extended features the system became much more responsive and intuitive to use as a performance instrument, reducing the conceptual gap between controlling a computer program and playing an acoustic instrument. This meant that I could integrate both software instrument and piano into the same performance setup, and treat them as extensions of each other.

Improvisation as dialogue

The situation that occurs when musicians improvise together can result in a multitude of different musical outcomes, music that does not by any means have to resemble a conversation. All the semiotic potential of any sound structure is available for the musicians to create music that can sound like anything and “be about” anything. Yet, the situation itself, improvising something together with others by means of sound is fundamentally the same as in any spontaneous conversation, even if cause and expression is totally different. Improvising musicians make utterances with sound, and one way or another have to relate to their own and other musician(s) utterances (and not relating is also an act of communication in that regard). The same applies to structural aspect of improvisation, including the continuous negotiation of what it is about and in which direction it goes, with past experiences influencing the expectations of what can happen and how it will develop. The difference of course being that with music there is an aesthetic purpose and public framing that opens for all kinds of roles and modes of interaction. Nevertheless, it is still a social situation where musicians have to relate to their fellow musicians’ sounding utterances to interpret intentions and ideas, and therein, I think, lies the similarity to spoken conversation and the possibility of experiencing improvised music as dialectically meaningful.

This is one reason for why I think it is interesting to juxtapose musical imprints of spoken conversation with improvised musical interaction, highlighting these similarities and the implicit discourse apparent in improvisation.


Fels, S. (2004). Designing for Intimacy: Creating New Interfaces for Musical Expression. Proceedings of the IEEE, 92(4), 672–685.

← Previous page: Reflections Next page: Sources and Concept