Prominent syllables highlight new or important information and make up an important rhythmical feature in speech. Prominence is determined by a combination of factors including vowel length, amplitude stress, high pitch accents, vocal quality and degree of vowel articulation (unstressed vowels tend to be reduced, meaning they are pronounced closer to the relaxed mid central vowel ə (schwa). Any combination of these can cause a syllable to be perceived as more prominent, sometimes expressed very subtly, but sometimes prominence is probably just inferred from syntax and context and even native speakers only agree upon about 85% when manually marking syllables as prominent or not. This means that making a simple 100% reliable prominence detector is very hard, something also reflected in the linguistic literature although some approaches are close to human performance when combining many factors (Tamburini, 2000; Obin, Rodet, & Lacheret-Dujour, 2009). For my purpose though, it might be enough to detect only clearly stressed syllables as I am more concerned by the resulting rhythm of stress patterns than making a linguistically correct prominence analysis.
For real time analysis I would optimally need to determine prominence at the onset, something that is actually impossible given that one cannot know at the onset if the syllable is going to be more prominent than the last. The closest thing would be to measure pitch and amplitude early in the syllable, but that would also rule out the most important feature of prominence: syllable duration (Obin, Rodet & Lacheret-Dujour, 2008). As duration is obviously evident only when the event is finished, that would result in a delayed detection in a real time analysis. While that also could work, I first tried to look at amplitude stress at syllable peaks, typically a short time into the vowel and thus almost as good as onset. But when measuring amplitudes we have the problem of reference point since both overall pitch and energy tend to fall naturally over the course of a breath due to the decrease of air pressure. This means that stress is relative to the surrounding syllables and a short time memory at about 7 syllables has been proposed in this regard (Martin, 2010). Others offline analyses typically looks at syllable before and after to determine local prominence. But like in real life, a real time analysis cannot know what will follow, so instead I am using a short moving average as reference. This setup seems to work for getting a rough rhythm of prominent syllables, using the amplitude of formant region 500-4800 Hz as a stress indicator for now. I will however look more into combining descriptors, including duration and pitch, and see if I can make that work even with the delay introduced by including duration in the analysis.
Tamburini, F. (2002). Automatic detection of prosodic prominence in continuous speech. In Proceedings of the 3rd International Conference on Language Resources and Evaluation 2002 LREC 02 (pp. 301–306). Retrieved from https://corpora.dslo.unibo.it/People/Tamburini/Pubs/LREC2002_ProsodicProminence.pdf
Obin, N., Rodet, X., & Lacheret-Dujour, A. (2009). A Syllable-Based Prominence Detection Model Based on Discriminant Analysis and Context-Dependency. In Speech and Computer (pp. 97–100). Russia. Retrieved from https://halshs.archives-ouvertes.fr/halshs-00636518
Obin, N., Rodet, X., & Lacheret-Dujour, A. (2008). French prominence: A probabilistic framewok. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 3993–3996). Las Vegas, NV. https://doi.org/10.1109/ICASSP.2008.4518529
Martin, P. (2010). Prominence detection without syllabic segmentation. In Speech Prosody 2010-Fifth International Conference.