Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine

Plenary Talks

Keynote Speaker 1 - Tuesday, May 2, 10:00 - 10:45 1 of 3

Julia Hirschberg, Columbia University

Recognizing and Conveying Speaker State Prosodically

Extended Abstract:
A speaker's mental state is often conveyed by acoustic and prosodic factors, as well as the words they choose and the gestures they use. Considerable research has been done in recent years to detect emotional state in IVR systems, so that angry or frustrated users can be directed to a human agent. Other research has sought to identify a wider variety of emotions and intentions in recorded meetings, again from acoustic and prosodic cues. From the perspective of speech generation, the problem of conveying emotional state has emerged as a critical topic in the continuing effort to make TTS systems sound more like real human beings. Computer game designers as well as IVR system developers all cite the limits of prosodic and emotional 'naturalness' as a barrier to using current systems.

In this talk I will describe ongoing research in the speech group at Columbia, designed to expand the variety of speaker states which may be identified and produced by acoustic and prosodic variation. I will describe recent work in the detection of confidence and uncertainty in a physics tutoring system (joint work with the University of Pittsburgh), work to identify the acoustic and prosodic characteristics of 'charismatic' speech across cultures, and research into the acoustic and prosodic indicators of deceptive speech (joint work with the University of Colorado and SRI International). I will also describe recent progress in the automatic detection of prosodic features which should make both recognition and generation of the prosodic characteristics of speaker state more accurate.

 
Keynote Speaker 2 - Wednesday, May 3, 09:00 - 09:45 2 of 3

Hartmut R. Pfitzinger, University of Munich

Five Dimensions of Prosody: Intensity, Intonation, Timing, Voice Quality, and Degree of Reduction

Extended Abstract:
This talk gives an overview of methods for analysis, modification, and synthesis of the prosodic properties of speech. The term prosodic properties is supposed to cover all phenomena that are not segmental and that are described on several tiers parallel to the segmental tier. Firth [1] called them prosodies and according to him e. g. distant assimilation (such as Turkish vowel harmony) is also covered by his term. This very broad meaning of the term prosodies is much closer to my view than the very common habit of saying prosody and meaning only intonation. One of the main purposes of this talk is to demonstrate that the manipulation of intonation or timing alone can sometimes produce prosodically contradictory stimuli which in turn inconsistently degrade perception results.

I entitled the talk "Five Dimensions of Prosody" because the first three dimensions intensity, intonation, and timing are very well known, and Campbell and Mokhtari [2] named voice quality the fourth prosodic dimension. Obviously, I have added another dimension, one which I would like to name the degree of reduction. The remainder of the talk is concerned with describing the analysis, modification, and synthesis of each of these five prosodies.

Intensity be measured easily reliably by means of root-mean-square, or rectifying and averaging, or, more precisely, by smoothing the instantaneous amplitude achieved via Hilbert transformation. It is often argued that intensity, whether short-term or long-term, has obviously minor communicative functions as can be seen from any broadcasting where the short-term amplitude is generally strongly manipulated to maximize loudness of speech without any noticeable impact on the meaning of the speech. But intensity should be taken into account when naturalness is important (e. g. in speech synthesis) or e. g. when perception stimuli with shifted word accents are necessary. In this case the shift of the intonation peak should be accompanied by a shift of an intensity peak.

Intonation is more complicated: although there exist countless F0 or glottal epoch detection algorithms (Hess [3] gives an overview), none of them is absolutely reliable and many of them work only on high-quality speech recordings, or on a limited range of F0 values, or use an inferior voiced/unvoiced-detection method, or are sensitive to amplitude variations, or suffer from other shortcomings. However, it turned out that error rates of modern algorithms are sufficiently low and many times cancelled by post-processing. Subsequent smoothing, extrapolation, and parameterization are necessary to make intonation accessible to meaningful modification, each of these methods with its own algorithmic problems. It turned out that the command-response model of Fujisaki allows for a very powerful parameterization of intonation contours in many languages [4]. Finally, pitch-synchronous overlap and add (PSOLA) is a very effective way to synthesize the new signal but it has problems especially with strong F0-changes and high-pitched female voices. Timing is even more complicated: In 1998 [5] I invented a model to estimate perceptual local speech rate (PLSR), a prosodic contour similar to F0 contours, easy to interpret and to modify. It is based on a linear combination of the local syllable rate and the local phone rate both of which are estimated from manual segmentations of phones and syllable centers, and it produces a mean deviation of 10 percent of the perceptual speech rate which is precise enough for phonetic studies and speech synthesis. Other methods such as Z-score-based duration contours [6] suffer from probable inconsistencies in their essential knowledge bases, i. e. prototypical mean durations and standard deviations of all speech segments, and from nonlinear elasticity of the segments. For simple copy-synthesis, dynamic time warping (DTW) is more appropriate since it needs no segmentation of both speech signals. Speech pauses, and especially hesitations and repairs further complicate the timing structure of speech. Their positions and durations are difficult to predict [7].

Voice quality is difficult to measure, modify, and synthesize. Convincing approaches are based on epoch detection methods and inverse filtering techniques, two significant sources of error. The goal is to obtain and parameterize the glottal- flow waveform which is supposed to carry all voice quality properties. While the paper of Campbell and Mokhtari [2] is based only on the normalized amplitude quotient (NAQ), which represents a continuum from breathy to modal or even pressed voice quality, a more holistic approach of Mokhtari, Pfitzinger, and Ishi [8] consisted in applying a principal components analysis (PCA) to a database of glottal-flow waveforms for the purpose of later reconstructing and interpolating all underlying glottal-flow waveforms from just a few principal components (PCs). A starting point to cover a wide range of laryngeal variations was the typology of phonation by Laver [9] and his recordings. It turned out that the first PC mainly accounted for F0 variations which raises the question as to whether the prosodic dimension intonation is better subsumed under the fourth dimension since variations of F0 also influence voice quality.

The degree of reduction is hardly ever interpreted as a prosody, and for good reason: in order to estimate the degree of reduction a huge amount of phonological knowledge is necessary. That is, the canonical form must be known for any utterance to count the number of elisions and insertions (effects in the time domain), and the target formant frequencies (or articulatory target positions) must be known to estimate the segmental undershoot (frequency domain effect). This is closely related to Lindblom's HH theory of phonetic variation [10]. One problem is that there is not only purely mechanical coarticulation, constrained by inherent properties of the speech organs, but also coarticulation rules learned during language acquisition. Even though this prosody is very difficult to estimate, first approaches are highly desirable since it is very important when manipulating speech. E. g. shifting the word accent from one syllable to another is a real problem because the former unstressed syllable usually is produced in a strongly reduced way (in English often as a Schwa) while the stressed syllable generally has a non-central vowel quality and a longer duration. Thus, the target syllable should become de-reduced and the source syllable reduced.

It should be clear that each of the above-mentioned prosodies has its segmental and supra-segmental manifestation. Actually, from my point of view the terms low-frequency components and high-frequency components of prosodies describe the speech facts in a better way. In this view, even the articulatory movements, and thus every detail that constitutes speech, could become prosodies. At the end of the talk two applications of prosodic modifications are demonstrated: one is speech morphing between two utterances of different speakers, which means estimating equally-spaced intermediate utterances with all prosodic properties changing in equal steps from one speaker to another. And the other application is in the field of computer-aided language learning (CALL). Here, we try to show that an automatic prosodic correction of the speech signal of a language learner and its auditory feedback help the learner to aquire a foreign language faster than by hearing the corrections spoken with the teacher's voice [11]. References

[1] Firth, J. R. (1948). Sounds and prosodies. Transactions of the Philological Society, pp. 127-152.

[2] Campbell, N.; Mokhtari, P. (2003). Voice quality: the 4th prosodic dimension. In Proc. of the XVth Int. Congress of Phonetic Sciences, vol. 3, pp. 2417-2420, Barcelona.

[3] Hess, W. (1983). Pitch determination of speech signals: Algorithms and devices. Springer-Verlag, Berlin, Heidelberg, New York.

[4] Fujisaki, H. (2004). Information, prosody, and modeling with emphasis on tonal features of speech. In Proc. of the 2nd Int. Conf. on Speech Prosody, pp. 1-10, Nara; Japan.

[5] Pfitzinger, H. R. (1998). Local speech rate as a combination of syllable and phone rate. In Proc. of ICSLP '98, vol. 3, pp. 1087-1090, Sydney.

[6] Campbell, W. N. (2000). Timing in speech: A multi-level process. In Horne, M., ed., Prosody: Theory and experiment. Studies presented to Gšosta Bruce, pp. 281-334. Kluwer Academic Publishers, Dordrecht.

[7] Pfitzinger, H. R.; Reichel, U. D. (2006). Text-based and signal-based prediction of break indices and pause durations. In Proc. of the 3rd Int. Conf. on Speech Prosody, Dresden; Germany.

[8] Mokhtari, P.; Pfitzinger, H. R.; Ishi, C. T. (2003). Principal components of glottal waveforms: Towards parameterisation and manipulation of laryngeal voicequality. In Proc. of the ISCA Tutorial and Research Workshop on Voice Quality: Functions, Analysis and Synthesis (Voqual'03), pp. 133-138, Geneva.

[9] Laver, J. (1980). The phonetic description of voice quality. Cambridge University Press, Cambridge.

[10] Lindblom, B. E. F. (1990). Explaining phonetic variation: A sketch of the HH theory. In Hardcastle, W. J.; Marchal, A., eds., Speech production and speech modelling, Nr. 55 in Nato ASI series D: Behavioural and social sciences, pp. 403-439. Kluwer Academic Publishers, Dordrecht, Boston, London. SPEECH PROSODY 2006 9

[11] Bissiri, M. P.; Pfitzinger, H. R.; Tillmann, H. G. (2006). Lexical stress training of German compounds through resynthesis and emphasis. Accepted for Proc. of InSTIL Workshop at CALICO, Hawaii.

 
Keynote Speaker 3 - Thursday, May 4, 09:00 - 09:45 3 of 3

Chiu-yu Tseng; Institute of Linguistics, Academia Sinica, Taiwan

Fluent Speech Prosody and Discourse Organization: Evidence of Top-down Governing and Implications to Speech Technology

Extended Abstract:
Both linguists and engineers ask questions about language and speech, but their concerns differ. Although both communities look for what makes up communication, linguists look for what constitutes the abstract linguistic system in the human mind and brain, while engineers look for ways to model and simulate speech for technology implementation. What if the question addressed is fluent speech of Mandarin Chinese, and the answers are to satisfy both linguists and engineers? Put in paraphrase, the question then becomes what is there to be studied in addition to lexical tones and intonation for the linguists, and how could fluent speech prosody be simulated in addition to adding up tones and intonations for the engineers. Trying boldly to bring answers to both communities, we decided first to adopt a corpus approach to phonetic studies, an attempt to remedy the traditional phonetic approach by looking at more samples. To ensure the corpora contain fluent prosody information, we collected narratives of read discourses rather than canonical phrases. A total of 9 set of speech corpora with different prosodic features were recorded over a decade (http://www.myet.com/COSPRO). We then designed a perceptually based annotation system that emphasized boundary information and boundary breaks and manually labeled the corpora. The annotated results were consistently identified multiple-phrase speech paragraphs and various kind of prosodic units within. We studied the acoustic phonetic correlates of the annotated paragraphs, units and boundary breaks in detail, and through quantitative analyses, found systematic cross-phrase patterns in every acoustic parameter for each unit identified. That is, F0 contours, syllable duration patterns, intensity distribution patterns, and on top of it, systematic boundary information and boundary breaks are found across phrases. These patterns are not only cross-speaker but also cross-speaking-rate. It became obvious that what constitutes fluency is neither in the tonal realization of each syllable, nor in the individual phrase intonation, but rather, in the association between and among intonation phrases (IP). The association came from higher up governing from the discourse. What these associations or associative prosodic relationships reflect is mainly governing from top-down. A framework of the multi-phrase hierarchy is subsequently constructed to account for fluent speech prosody. The term Prosodic Phrase Grouping (PG) was proposed for the framework to denote how intonation phrases (IP) were grouped to form a higher and larger prosodic unit; a unit that roughly corresponds to speech paragraphs in narratives or spoken discourses. Central to the framework is the notion that individual phrasal intonations are subjacent sister constituents subject to higher level constraints that specify layered modifications at each prosodic level; while ultimate output fluent prosody is achieved by adding up contributions from each prosodic layer. From our data analyses, we were able to show just how cumulative modifications account for the overall patterns in fluent speech, in particular, syllable duration as well as boundary pause patterns (Tseng et al., 2005). Subsequently, we were able to derive acoustic templates for each prosodic unit in the framework, namely, templates for global F0 contours, syllable durations and intensity distribution. These templates facilitated constructing a modular model of multiple-phrase grouping with 4 corresponding acoustic modules for speech synthesis applications.

By the same logic, we also view spoken discourse prosody as yet another higher node that groups PGs into sister constituents. Our more recent works are to establish discourse prosody organization from the PG upward. Again looking at the larger picture we studied relative F0 range narrowing vs. widening as well as F0 resets across PGs and boundaries. So far we have found two types of prosodic links that involved F0 narrowing and subsequent F0 reset. One type of F0 narrowing is duration triggered and redundant, which we term as Prosodic Fillers (PF); another is lexically and/or syntactically triggered and obligatory, which we term as Discourse Markers (DM). The main function of these two links appears to be a major source of melodic and rhythmic variation in output prosody. They also turned out to be predictable from text analyses.

In summary, what the prosodic specifications discussed above revealed is essentially the global overall relative prosodic relationships across phrases in fluent speech; what they reflected is top-down governing of semantic constraints from the discourse and cognitive constraints from the speaker. All of them are crucial to on-line speech planning and processing of discourse information. We argue that any prosody framework of fluent speech should include top-down information, specify how intonation phrases are formed, and take into considerations perceptual effects to on-line processing. Moreover, how discourse prosody is organized deems further attention. Technology developments could serve as the best testing ground for these findings. As for a tone language such as Mandarin Chinese, in addition to syllable tones and phrase intonations, there also exists a cross-phrase melody, rhythm and loudness pattern necessary to forms its fluent speech prosody. We believe these non-tonal aspects not only bear cross-linguistic significance, but also merits more attention in studies of tone languages in general.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Special Session 1 (SPS 1) Prosody and Affective Computing
Organizers: Noam Amir, Nick Campbell and Jianhua Tao
Tuesday, May 2, 11:10 - 13:10
Special Session 1: Prosody and Affective Computing 1 of 6

The Prosody of Pet Robot Directed Speech: Evidence from Children

AUTHOR(S):
Batliner, Anton; Chair for Pattern Recognition, University of Erlangen-Nuremberg
Biersack, Sonja; Department of Psychology, University of Stirling
Steidl, Stefan; Chair for Pattern Recognition, University of Erlangen-Nuremberg

Abstract:
In this paper, we present a database with emotional children's speech in a humanrobot scenario: the children were giving instructions to Sony's pet robot dog AIBO, with AIBO showing both obedient and disobedient behaviour. In such a scenario, a specific type of partner-centered interaction can be observed. We aimed at finding prosodic correlates of children's emotional speech and were interested to see which speech registers children use when talking to AIBO. For interpretation, we left the weighting and categorization of prosodic features to a statistic classifier. The parameters found to be most important were word duration, average energy, variation in pitch and energy, and harmonics-to-noise ratio. The data moreover suggests that the children used a register that resembled mostly child-directed and pet-directed speech and to some extent computer-directed speech.

 
Special Session 1: Prosody and Affective Computing 2 of 6

Modelling personality features by changing prosody in synthetic speech

AUTHOR(S):
Trouvain, Jürgen; Phonetik-Büro Trouvain, Saarbrücken & Institute of Phonetics, Saarland University
Schmidt, Sarah; Institute of Computer Science, Saarland University
Schröder, Marc; DFKI GmbH, Saarbrücken
Schmitz, Michael; Institute of Computer Science, Saarland University
Barry, William J.; Institute of Phonetics, Saarland University

Abstract:
This study explores how features of brand personalities can be modelled with the prosodic parameters pitch level, pitch range, articulation rate and loudness. Experiments with parametrical diphone synthesis showed that listeners rated the prosodically changed versions better than a baseline version for the dimensions "sincerity", "competence", "sophistication", "excitement" and "ruggedness". The contribution of prosodic features such as lower pitch and an enlarged pitch range are analyzed and discussed.

 
Special Session 1: Prosody and Affective Computing 3 of 6

Modeling Emotion Expression and Perception Behavior in Auditive Emotion Evaluation

AUTHOR(S):
Grimm, Michael; Universität Karlsruhe (TH), Karlsruhe
Kroschel, Kristian; Universität Karlsruhe (TH), Karlsruhe
Narayanan, Shrikanth; University of Southern California (USC), Los Angeles

Abstract:
In this paper, we consider both speaker dependent and listener dependent aspects in the assessment of emotions in speech. We model the speaker dependencies in emotional speech production by two parameters, Emotion Expression Bias and Emotion Expression Amplification. Similarly, we model the listener's emotion perception behavior by a simple parametric model, the correlation with the mean value of all evaluators. These models form a basis for improving current automatic emotion recognition schemes. An emotional speech database of the four emotion categories angry, happy, neutral, and sad was evaluated on three emotion primitives, valence, activation, and dominance. The assessment results were used to analyze the variations of the class centroids in the 3D emotion space as a function of speaker and listener. We found that the models are simple and efficient for describing individual emotion expression styles and emotion perception behavior in speech.

 
Special Session 1: Prosody and Affective Computing 4 of 6

Perception of Non-Verbal Emotional Listener Feedback

AUTHOR(S):
Schröder, Marc; DFKI GmbH
Heylen, Dirk; University of Twente
Poggi, Isabella; University of Rome

Abstract:
This paper reports on a listening test assessing the perception of short non-verbal emotional vocalisations emitted by a listener as feedback to the speaker. We clarify the concepts of backchannel and feedback, and investigate the use of affect bursts as a means of giving emotional feedback via the backchannel. Experiments with German and Dutch subjects confirm that the recognition of emotion from affect bursts in a dialogical context is similar to their perception in isolation. We also investigate the acceptability of affect bursts when used as listener feedback. Acceptability appears to be linked to display rules for emotion expression. While many ratings were similar between Dutch and German listeners, a number of clear differences was found, suggesting language-specific affect bursts.

 
Special Session 1: Prosody and Affective Computing 5 of 6

Expressive Speech Synthesis: Evaluation of a Voice Quality Centered Coder on the Different Acoustic Dimensions

AUTHOR(S):
Audibert, Nicolas; ICP
Vincent, Damien; France Telecom, R&D Division
Aubergé, Véronique; ICP
Rosec, Olivier; France Telecom, R&D Division

Abstract:
Expressive speech is intrinsically multi-dimensional. Each acoustic dimension has specific weights depending on the nature of the expressed affects. The quantity of information carried by each dimension separately, as well as the processing implied to carry it has been perceptively measured for a set of natural mono-syllabic utterances. It has been shown that no parameter alone is able to carry the whole emotion information These stimuli (anxiety, disappointment, disgust, disquiet, joy, resignation, sadness) were resynthesized with an LF-ARX algorithm, and evaluated in the same perceptive protocol extended to the VQ parameters (source, filter and residue). The comparison of results between natural, TD-Psola resynthesized and LF-ARX resynthesized stimuli (1) globally confirms the relative weights of each dimension (2) diagnoses local minor artifacts of resynthesis (3) validates the efficiency of the LF-ARX algorithm (4) measures the relative importance of each of LF-ARX parameters.

 
Special Session 1: Prosody and Affective Computing 6 of 6

On the Structure of Spoken Language

AUTHOR(S):
Campbell, Nick; Advanced Telecommunications Research Institute, Kyoto

Abstract:
The special structure of spoken language is often described as "ill-formed" but this paper shows that it is ideally suited to the simultaneous expression of (a) propositional content (i. e., linguistic information) and (b) speaker-state, discourse management cues, and speaker-listener-relationships (i. e., affective information). This paper shows that by the frequent insertion of so-called "fillers" and other repetitive fragments, the speaker provides the listener with constant reference points for evaluating affective states as displayed by voice-quality information.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Poster Session 1 (PS 1) Prosody and Speech Perception
Tuesday, May 2, 14:30 - 16:00
Chair: Anne Cutler
Poster Session 1: Prosody and Speech Perception 1 of 24

Timing in News and Weather Forecasts: Implications for Perception

AUTHOR(S):
Shevchenko, Tatiana; Moscow State Linguistic University
Uglova, Natalia; Moscow State Linguistic University

Abstract:
This paper addresses the problem of prosodic text organization in the situation of severe time limits for TV information programs. It is a search for techniques used to make a compromise between temporal constraints and the demand for distinctiveness of speech targeted at mass audience. Tempo and pitch characteristics of American TV news and weather forecasts (9 items from 4 stations, total time 10min) are explored with reference to genre, region and gender of newsreaders. Combinations of features account for native speakers' perception of speech as 'fast' or 'too fast'. The diagnostic parameters are: length of uninterrupted speech units, types of pauses, number of accents per unit, accented and unaccented syllable length, Fo max and Fo intervals in key words and units. The data obtained, when compared to previous research results on interviews, reading, public speaking and spontaneous talk, revealed phonation/pause time ratio to be most relevant.

 
Poster Session 1: Prosody and Speech Perception 2 of 24

Identification of language and accent through visual speech

AUTHOR(S):
Irwin, Amy; Institute of Hearing Research
Thomas, Sharon; Institute of Hearing Research

Abstract:
Facial movements can be utilised in the processing of visual speech and form the basis of speechreading. However, the production of speech by different talkers can be variable; physiology, accent and speech rate can all change the appearance of the visual signal. The focus of this report is an investigation into the effects of language and accent variation on speechreading, an area previously lacking in systematic research.Results from two experiments indicate, firstly, that the visual differences between French and English, (both accent and language) can be discriminated through visual speech. Secondly, in a comparison of speechreading performance, sentences produced using a French accent were found to be significantly more difficult to speechread by English observers than those produced in an English accent. This research indicates the importance of further study into the effects of accent on speechreading.

 
Poster Session 1: Prosody and Speech Perception 3 of 24

Dialect identification through prosodic information: an experimental approach

AUTHOR(S):
Dimou, Athanassia; Université Paris 7
Chalamandaris, Aimilios; ILSP

Abstract:
The purpose of this paper is to investigate whether native Greek adults can identify their mother tongue from synthesized stimuli which contain only prosodic - melodic and rhythmic - information. More specifically we are trying to investigate whether Greek native speakers are able to discriminate their mother dialect form another also from Greece, from prosodic only information. In the first section we present the main idea behind our work, in the second section we present the procedure we followed in order to complete this pilot study, while at the two final sections one can find the results and the conclusions of our experiments.

 
Poster Session 1: Prosody and Speech Perception 4 of 24

Fake geminates in French: a production and perception study

AUTHOR(S):
Meisenburg, Trudel; University of Osnabrück

Abstract:
This paper examines the role of consonantal quantity from Latin to the Romance languages, concentrating on the situation in contemporary French, where fake or apparent geminates quite frequently arise in morpheme concatenation, often as a consequence of schwa deletion. A series of production and perception experiments shows that the required surface contrasts are neither represented nor identified consistently, speakers rather show a tendency to delete geminates in favor of a simplified syllable structure but at the cost of morpheme identity.

 
Poster Session 1: Prosody and Speech Perception 5 of 24

Interpretation - Perception - Analysis

AUTHOR(S):
Dohalská-Zichová, Marie; Institut of Phonetics, Charles University in Prague
Škardová, Radka; Institut of Phonetics, Charles University in Prague

Abstract:
The aim of this experiment was to prove via perception tests, in what way two phonetic groups (i.e. the French and Czechs with proficient knowledge of French) and two non-phonetic control-groups of listeners perceive the differences in the individual prosodic demonstration of two types of artistic interpretations of the poem "Mon rêve familier". At the same time the aim was to compare and contrast subjective perceptual levels with objective measurements of F0, intensity and time values. If we take into account the fractional representation and the importance of individual values for the accents perception, then we can conclude that both the French and Czechs consider the T value as the crucial value, the second place in terms of importance of values differs - for Czechs it is intensity followed by frequency (T-I-F0); on the contrary, for the French on second place being frequency followed by intensity (T-F0-I).

 
Poster Session 1: Prosody and Speech Perception 6 of 24

Perception of Anger in French as Foreign Language: Experimental Protocol and Preliminary Results

AUTHOR(S):
Mathon, Catherine; EA333 "Atelier de Recherches sur la Parole"
de Abreu, Sophie; EA333 "Atelier de Recherches sur la Parole"
Perekopska, Daniela; EA333 "Atelier de Recherches sur la Parole"

Abstract:
Learners of a foreign language need to perceive the emotions of her or his interlocutor. They also need to be able to reproduce an emotion in a satisfactory prosodic pattern in the foreign language. Otherwise, the communication will fail. Our project deals with 3 main questions: How are emotions perceived in a foreign language? Will a learner be able to reproduce such an emotion and how? How will these (re)productions be recognized by native speakers? We first concentrated on the study of the emotion called Anger. This paper aims to show if prosody provides enough information to allow students of French as a Foreign language (FFL) to recognize this emotion. The perceptual test presented here is original because of the use of spontaneous corpus of French containing real emotions. We focus here on the first stage of our research: the results of the perception of anger by Czech and Portuguese speakers. We insist on the methodology as well as the experimental protocol of our work.

 
Poster Session 1: Prosody and Speech Perception 7 of 24

Exploring Expressive Speech Space in an Audio-book

AUTHOR(S):
Wang, Lijuan; Dept. of Electronic Engineering, Tsinghua University, Beijing
Zhao, Yong; Microsoft Research Asia, Beijing
Chu, Min; Microsoft Research Asia, Beijing
Chen, Yining; Microsoft Research Asia, Beijing
Soong, Frank; Microsoft Research Asia, Beijing
Cao, Zhigang; Dept. of Electronic Engineering, Tsinghua University, Beijing

Abstract:
In this paper, an audio-book, in which a professional voice talent performs multiple characters, is exploited to investigate the expressiveness of speech. The expressive speech space of the sole speaker is explored by finding the distances between acoustic models of multiple characters and the perceived proximity between their speech utterances. Using the speech of ten characters as test data, the character confusion is evaluated in both acoustic and perceptual spaces. We find that the average precision to differentiate one character from the others is 81.7 % in the acoustic space and 72.6 % in the perceptual space. It is interesting that the objective measure outperforms the subjective measure. Furthermore, the acoustic distance measured by normalized Kullback-Leibler divergence (NKLD) between two characters is highly correlated with the perceptual distance with correlation coefficient 0.814. Therefore, NKLD can objectively measure the perceptual similarity between groups of utterances.

 
Poster Session 1: Prosody and Speech Perception 8 of 24

A Comparative Study of Sentential Stress Distribution in Mandarin Multi-Style Speeches

AUTHOR(S):
Bao, Mingzhen; University of Florida
Chu, Min; Microsoft Research Asia

Abstract:
This paper compares the distribution of sentential stresses among three speaking styles: Lyric, Critical, and Explanatory; and extends our previous study in the base phrase level to the sentence construction level and the prosodic word level. The results show that 1) The distributions of both rhythmic and semantic stresses act the same among styles within prosodic words; 2) In the sentence construction level, the distribution of rhythmic stress is quite similar across three styles, while semantic stress presents more diversity among speaking styles. The Explanatory style shares a similar tendency with the Neutral style. The Lyric and Critical styles differ from the Neutral style in subject-predicate, predicate-object, adjunctsubject, and adjunct-object constructions. Generally, speaking styles have fewer effects on rhythmic stress distribution than on semantic stress. Such effects are more obvious in the sentence construction and the base phrase levels than the prosodic word level.

 
Poster Session 1: Prosody and Speech Perception 9 of 24

Reliable Prominence Identification in English Spontaneous Speech

AUTHOR(S):
Tamburini, Fabio; DSLO - University of Bologna

Abstract:
This paper presents a follow up of a study on the automatic detection of prosodic prominence in spontaneous speech. Prosodic prominence involves two different prosodic features, pitch accent and stress, that are typically based on four acoustic parameters: fundamental frequency (F0) movements, overall syllable energy, syllable nuclei duration and mid-to-high-frequency emphasis. A careful measurement of these acoustic parameters makes it possible to build an automatic system capable of identifying prominent syllables in utterances with performance comparable with the inter-human agreement reported in the literature even when tested on spontaneous speech.

 
Poster Session 1: Prosody and Speech Perception 10 of 24

Form and Function of Falling Pitch Contours in English

AUTHOR(S):
Kleber, Felicitas; Institute of Phonetics and Digital Speech Processing (IPDS), Christian-Albrechts-University at Kiel

Abstract:
This paper presents the results of a set of perception experiments concerning the phonological status of early, medial and late F0 peak synchronization in English and the nature of the contrast between these categories. By means of one identification and two discrimination tasks, it has been shown that subjects perceive a categorical-like change when the F0 maximum of a peak is shifted into the stressed vowel and a gradual change when the F0 maximum is moved into the following unstressed vowel. Therefore, we conclude that the early peak constitutes a phonological category as opposed to medial peaks; late peaks form a phonetic continuum.

 
Poster Session 1: Prosody and Speech Perception 11 of 24

Relevance of F0 peak shape and alignment for the perception of a functional contrast in Russian

AUTHOR(S):
Rathcke, Tamara; Institute of Phonetics and Digital Speech Processing

Abstract:
This paper reports a perception experiment carried out to investigate the perceptually relevant properties of yes/no-questions and contrastive emphasis in modern Russian spoken by young people in Kaliningrad. Only melodic cues were involved in the test stimuli such as alignment and shape of F0 peaks as well as presence of a peak plateau. A semantic congruity test was performed to investigate these form-function relations. Results indicate that peak alignment is the strongest cue for the perceptual distinction of the investigated categories. Contour shape (including plateau property) serves as a secondary cue, whereas the effect of a plateau seems to be very small. Results are discussed in terms of phonological modeling of Russian intonation based on an experimental approach including the investigation of intonational forms in relation to linguistic functions.

 
Poster Session 1: Prosody and Speech Perception 12 of 24

Categorical Perception of intonational contrasts in European Portuguese

AUTHOR(S):
Falé, Isabel; Onset - CEL, Lab. Psicoling, DLGR, FLUL; Universidade Aberta
Faria, Isabel Hub; Onset - CEL, Lab. Psicoling, DLGR, FLUL

Abstract:
European Portuguese intonational contrast between statement and question contours was tested on a Categorical Perception based paradigm. From 2 natural sentences one produced by a male speaker and another by a female, one multi-step continuum from each sentence was created, from declarative to question contour, through acoustic manipulation (PSOLA) and submitted to 20 EP listeners that performed two tasks: an identification and a discrimination task.For the identification test, subjects had to categorize each presented stimulus. In addition to response data, reaction times of the identification task were also collected. Experimental design and procedures were developed with E-Prime.Identification results confirmed that the contrast is indeed categorical. However, identification reaction times measurements point to continuous rather than categorical perception. The absence of a consistent peak of discrimination in the crossover between categories supports the continuous perception view.

 
Poster Session 1: Prosody and Speech Perception 13 of 24

Secondary stress in Brazilian Portuguese: the interplay between production and perception studies

AUTHOR(S):
Arantes, Pablo; State University of Campinas
Barbosa, Plinio; State University of Campinas

Abstract:
This paper reports experiments on speech production showing that secondary stress in Brazilian Portuguese (BP) can be best described as phrase-initial prominence cued by greater duration and pitch accent excursion in initial position. It also reports a perception experiment in which clicks were associated to consecutive V-to-V positions in stress groups. Mean click detection RTs are gradient, but show no influence of initial lengthening. RTs near the phrasally stressed position are shorter and almost 60 % of RT variance can be accounted for by produced timing patterns.

 
Poster Session 1: Prosody and Speech Perception 14 of 24

Perception of Cantonese level tones influenced by context position

AUTHOR(S):
Zheng, Hongying; City University of Hong Kong
Peng, Gang; The Chinese University of Hong Kong
Tsang, Peter W-M.; City University of Hong Kong
Wang, William S-Y.; The Chinese University of Hong Kong

Abstract:
When humans perceive speech sounds, they categorize the sounds into one or another phoneme category. Perception of speech sound depends on context. Previous studies on categorical perception of lexical tones were mainly done in an absolute manner without context. In these experiments we explore the influence of context on the categorical perception of lexical tones. In particular, we ask whether the position of the context with respect to the target syllable influences the categoricalness of the perception. Two experiments on natural and synthesized speech both show that categorical boundaries of identification curves are sharper when the context is to the right of the target syllable than when the context is to the left of the target syllable. Moreover, steeper peaks are obtained in the discrimination curve from right context continuum. They agree with and enhance the identification results. Explanations of the phenomenon are suggested in the paper.

 
Poster Session 1: Prosody and Speech Perception 15 of 24

Perception of Isolated Tone2 words in Mandarin Chinese

AUTHOR(S):
Xu, Lei; Linguistics, Ohio State University, Columbus
Speer, Shari R.; Linguistics, Ohio State University, Columbus

Abstract:
Many tone3 words in Mandarin undergo "third tone sandhi" - a phonological rule that changes the first tone3 word in a tone3+tone3 sequence to a tone2 word. Spoken tone2 words that have tone3 counterparts are thus ambiguous. A cross modal priming experiment examined lexical tone processing during word recognition. Participants saw Chinese characters of 4 kinds: identical, different-only-in-tone, irrelevant to the auditory word or nonword. Visual targets were preceded by auditory primes of 4 types: tone2 word with tone3 counterpart, tone2 word w/out tone3 counterpart, tone3 word with tone2 counterpart, or tone3 word w/out tone3 counterpart. RTs were longer for tone2 words with tone3 counterparts than for tone2 words w/out tone3 counterparts, while RTs to tone3 words with or w/out tone2 counterparts did not differ. Results suggest integration of tonal and segmental information during word recognition, without recourse to a separable "tonal level".

 
Poster Session 1: Prosody and Speech Perception 16 of 24

Perception of L2 Tones: L1 Lexical Tone Experience May Not Help

AUTHOR(S):
Wang, Xinchun; California Sate University, Fresno

Abstract:
This study investigates whether adult L2 learners' experience with lexical tones and pitch accent in their first language facilitates the acquisition of L2 lexical tones. Three groups of beginning learners of Mandarin with different L1 prosodic experience: native Hmong (a tone language), native Japanese (a pitch and accent language), and native English (a non-tone, non-pitch accent language) speakers participated as listeners in a perception test on the four Mandarin tones. Results showed that native English listeners performed equally well as native Japanese listeners but native Hmong speakers performed significantly worse than the native Japanese and native English speakers in perceptual accuracy of Mandarin tones. The findings suggest that experience with lexical tones and pitch accent may not always facilitate learning. The lack of exact mapping of L2 tones onto L1 tones may interfere with the acquisition of nonnative tones especially at the initial stage of learning.

 
Poster Session 1: Prosody and Speech Perception 17 of 24

Lexical Accent Status Affects Perceived Prominence of Intonational Peaks in Japanese

AUTHOR(S):
Shinya, Takahito; University of Massachusetts, Amherst

Abstract:
This study shows that lexical accent status affects perceived prominence of fundamental frequency (F0) peaks in Japanese. In Japanese, word accent type can be identified from two different sources: lexical accent status and phonetic F0 contour shape. This study examines whether listeners compensate for the accentual boost of an accented word based only on the word's lexical accent status, when no F0 contour information is available. A perceptual experiment was conducted in which participants judged the relative prominence between two F0 peaks. The experiment showed that for a given second F0 peak height, the first F0 peak height was higher when the first word was lexically accented than when it was lexically unaccented in order for the two words to be equal in perceived prominence. his suggests that the accentual boost of an accented word is subtracted in perception. It is concluded that lexical accent status as phonological knowledge affects perceived prominence of F0 peaks.

 
Poster Session 1: Prosody and Speech Perception 18 of 24

The recognition of Japanese-accented and unaccented English words by Japanese listeners

AUTHOR(S):
Yoneyama, Kiyoko; Daito Bunka University

Abstract:
This study investigated whether Japanese listeners learning English employ two types of lexical information (word frequency and neighborhood density) when they recognize English words. English words recorded by a native speaker of English and a native speaker of Japanese were presented to Japanese university students in a noise condition. The results of word recognition scores showed that Japanese listeners employed both lexical and pre-lexical levels of information in English word recognition. They were sensitive to both probabilistic phonotactics (bottom-up acoustic information) and word frequency (lexical information). A strong correlation between probabilistic phonotactics and neighborhood density still predict Japanese listeners are influenced by neighborhood density in English word recognition.

 
Poster Session 1: Prosody and Speech Perception 19 of 24

The contribution of silent pauses to the perception of prosodic boundaries in Korean read speech.

AUTHOR(S):
Hirst, Daniel; CNRS, UMR 6057

Abstract:
This paper discusses the importance of silent pauses in the perception of prosodic boundaries in Korean speech. It is suggested that in speech in general, and in particular in spontaneous speech, silent pauses are neither necessary nor sufficient for the perception of prosodic boundaries. In read speech, however, there is a high correlation between the presence of a pause and the perception of a boundary. An experiment was carried out to determine whether removing the silent boundary from an extract of speech had a significant effect on the perception of boundaries in Korean read speech. Results suggest that while the presence of a silent boundary slightly reinforces the perception of a prosodic boundary, subjects are in general capable of perceiving the boundary without the silent pause.

 
Poster Session 1: Prosody and Speech Perception 20 of 24

The perception of intended speech rate in English, French, and German by French listeners

AUTHOR(S):
Dellwo, Volker; Dept. of Phonetics and Linguistics, University College London
Ferragne, Emmanuel; Laboratoir Dynamique du Langage, Univ. Lyon 2
Pellegrino, Francois; Laboratoir Dynamique du Langage, Univ. Lyon 2

Abstract:
Speakers are able to produce speech at different intended rates when prompted to do so. The question addressed in the present research is to what degree different intended rate categories are perceptually relevant when objective measures of speech rate (e.g. syllables/second) are variable and to what degree listeners are able to identify intended speech rates in languages other than their native language. Initial results from an experiment with French listeners rating speech rates in French, German, and English show that, despite varying objective speech rates, listeners are well able to identify intended speech rate across different languages.

 
Poster Session 1: Prosody and Speech Perception 21 of 24

Comparing Perceptual Local Speech Rate of German and Japanese Speech

AUTHOR(S):
Pfitzinger, Hartmut R.; Institute of Phonetics and Speech Communication, University of Munich
Tamashima, Miyuki; Institute of Phonetics and Speech Communication, University of Munich

Abstract:
Possibly everybody who listens to people talking to each other in an unknown language gains the impression that they are speaking very fast. To test the effect of language background on perceptual local speech rate (PLSR) we conducted a fully symmetrical perception experiment in which two groups with different language backgrounds judge the speech rates of stimuli from both languages. 160 short German and Japanese speech stimuli are judged by 40 German and Japanese subjects. Japanese listeners overshoot German speech rate by 7.5 % on a PLSR scale and German listeners overshoot Japanese speech rate by 9.1 %. An explanation is that unknown languages appear to be spoken faster because listeners are unable to identify and attenuate redundant features of the unknown speech and, at the same time, they unconsciously insert additional phonetic items to reduce the mismatch between the large number of recognized phonetic items and the phonotactic structure of their native languages.

 
Poster Session 1: Prosody and Speech Perception 22 of 24

The Role of the Accented-Vowel Onset in the Perception of German Early and Medial Peaks

AUTHOR(S):
Niebuhr, Oliver; Institut für Phonetik und digitale Sprachverarbeitung, Christian- Albrechts-Universität Kiel

Abstract:
Starting from a series of speech stimuli representing an F0 peak shift continuum from German early to medial peak, a series of non-speech stimuli is created. These non-speech stimuli show the F0 and intensity courses of the original speech stimuli, but with a constant formant structure. The results of a perception experiment reveal that the organisation of the peak shift continuum found for the identification of early and me-dial peaks in the speech stimuli can be replicated by the non-speech stimuli, indicating that early and medial peaks are signalled by an interplay of the F0 and intensity courses without reference to the spectral change at the accentedvowel onset.

 
Poster Session 1: Prosody and Speech Perception 23 of 24

Clause position within a sentence: human vs. machine recognition

AUTHOR(S):
Palková, Zdena; Institute of Phonetics, Charles University in Prague
Volín, Jan; Institute of Phonetics, Charles University in Prague

Abstract:
The paper presents a combined experiment in which recognition of a prosodic phrase position within a larger syntactic structure by human listeners is confronted with recognition by artificial neural networks. Apart from the success rate we are predominantly interested in similarities in the error pattern of the two recognition modes. The results suggest that the automatic recognition could help to determine which of the selected parameters are relevant for human listeners, since it provides linguistically interpretable outcome.

 
Poster Session 1: Prosody and Speech Perception 24 of 24

Lateralized processing in human auditory cortex during the perception of emotional prosody

AUTHOR(S):
Wendt, Beate; Leibniz-Institute for Neurobiology
Brechmann, André; Leibniz-Institute for Neurobiology
Gaschler-Markefski, Birgit; Leibniz-Institute for Neurobiology
Scheich, Henning; Leibniz-Institute for Neurobiology
Ackermann, Hermann; University of Tübingen

Abstract:
The aim of the present fMRI-study was to investigate the influence of different word prosodies on the activation of the auditory cortex. Pseudowords and semantically neutral words were presented with neutral prosody in experiment I and with emotional prosodies in experiment II. In both studies there was a left lateralized activation for speech perception on planum temporale. In our experiments the emotional information was task-irrelevant and even distracted from the lexical task. The performance in the detection of words and pseudowords was significantly better in the prosodically neutral condition. Thus, the current results contribute to the clarification of the controversial issue whether prosodies lateralize brain activation to the right, i.e. if lexical rather than prosodic information is in the focus of a task involving prosodic steed material, a right hemisphere dominance cannot be expected. Future experiments with prosody identification tasks will extend these findings.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Poster Session 2 (PS 2) Analysis and Formulation of Prosody
Tuesday, May 2, 14:30 - 16:00
Chair: Daniel Hirst
Poster Session 2: Analysis and Formulation of Prosody1 of 22

Phonetics vs. phonology in Tamil wh-questions

AUTHOR(S):
Keane, Elinor; Christ Church, Oxford & Oxford University Phonetics Laboratory

Abstract:
Wh-questions in Tamil are not distinguished from declarative utterances by either pitch accent type or boundary tone. Acoustic analysis of data from 18 speakers comparing wh-questions with corresponding declaratives revealed that the lexical marking of interrogativity is nevertheless accompanied by differences in intonation. The most consistent result was raising of f0 peaks in question words and in a majority of speakers, including all the females, sentence offset f0 was significantly higher in questions. This tended to be accompanied by lowering of f0 peaks following question words, resulting in some compression of the pitch register. In marking interrogativity Tamil thus manipulates gradient phonetic parameters, adding further fuel to the debate about whether such parameters can directly signal linguistic information or are mediated via some elaborated phonological representation.

 
Poster Session 2: Analysis and Formulation of Prosody2 of 22

Empirical Validation of Hand-labelled Nuclear Accent Patterns

AUTHOR(S):
Grabe, Esther; Phonetics Laboratory, University of Oxford
Kochanski, Greg; Phonetics Laboratory, University of Oxford
Coleman, John; Phonetics Laboratory, University of Oxford

Abstract:
In this paper, we explore the interface between intonational phonology and speech technology, in search of bridges between the disciplines. In a corpus containing speech data from seven dialects of English, we hand-labelled over 700 nuclear accents and identified seven accent types. Then we used four-term mathematical models to describe the fundamental frequency patterns associated with the accents. A statistical analysis showed that the models for six of the seven accents differed significantly from each other. Our hand-labels were associated with consistently different f0 patterns. Our approach bridges the gap between intonational phonology and speech technology. It provides quantitative, empirically testable models of intonation labels that can be implemented in applications.

 
Poster Session 2: Analysis and Formulation of Prosody2 of 22

Phonologies and Phonetics of French Prosody

AUTHOR(S):
Martin, Philippe; Université Paris 7 Denis Diderot

Abstract:
Studies on French intonation are quite diversified, to the point where, looking at the descriptive results, one might wonder if all researchers did analyze the same language. Remarkable prosodic characteristics found in one study are not retrieved in another, and different theoretical approaches give very different insights on data, despite very similar experimental material. In this paper we attempt to highlight some converging aspects of two types of intonation linguistic description on French, developed one in the Autosegmental-Metrical framework and the other with a phonosyntactic point of view. In particular, the contrast of melodic slope may be totally hidden with one approach, and appear as the main characteristic of French intonation with the other.

 
Poster Session 2: Analysis and Formulation of Prosody4 of 22

Text-based and Signal-based Prediction of Break Indices and Pause Durations

AUTHOR(S):
Pfitzinger, Hartmut R.; Institute of Phonetics and Speech Communication, University of Munich
Reichel, Uwe D.; Institute of Phonetics and Speech Communication, University of Munich

Abstract:
The relation between symbolic and signal features of prosodic boundaries is experimentally studied using prediction methods. Text-based break index prediction turns out to be fairly good, but signal-based prediction and pause duration prediction perform worse. A possible reason is that random signal feature variations, as usually produced by humans, are hard to predict.

 
Poster Session 2: Analysis and Formulation of Prosody5 of 22

Analysis of Polish Segmental Duration with CART

AUTHOR(S):
Breuer, Stefan; Institute of Communication Sciences, University of Bonn
Francuzik, Katarzyna; Institute of Linguistics, Adam Mickiewicz University PoznaŽn
Demenko, Grażyna; Institute of Linguistics, Adam Mickiewicz University PoznaŽn

Abstract:
Segmental duration was investigated in a database of Polish read speech (from one male speaker). The material was labeled automatically and then manually verified. The dependence of phone duration on a set of features was verified with the CART algorithm. The duration phenomena were analyzed in relation to syllable, foot and phrase structure. The results showed the need of segmental as well as suprasegmental modeling for the analysis of segmental duration.

 
Poster Session 2: Analysis and Formulation of Prosody6 of 22

The stylization of intonation contours

AUTHOR(S):
Demenko, Grażyna; Institute of Linguistics, Adam Mickiewicz University PoznaŽn
Wagner, Agnieszka; Institute of Linguistics, Adam Mickiewicz University PoznaŽn

Abstract:
This paper presents the stylization of intonation contours and clustering of F0 movements on accented and post-accented syllables based on annotated speech corpora. Special software - PitchLine - has been developed to enable the flexible quasi-automatic segmentation and parametrization of intonation curves. The experimental material obtained from a 15 min passage read by a male speaker included more than 1200 annotated accents and several hundred phrase boundaries. The accuracy of the stylization method was evaluated by measuring NMSE error between original and stylized F0 contours and in a perception study. Stylized F0 contours which were perceived as very different from the original ones required further analysis and re-stylization. Finally, 640 mono-tonal accents formed 6 clusters and 580 bi-tonal accents formed another 6 clusters. The results of clustering confirmed the correctness of the stylization rules.

 
Poster Session 2: Analysis and Formulation of Prosody7 of 22

Automatic Pitch Stylization Enhanced with Top-Down Processing

AUTHOR(S):
Wypych, Mikolaj; IFTR, Polish Academy of Sciences

Abstract:
In the article an original method of pitch stylization from speech waveform and its orthographic transcript is presented. In addition to bottom-up data processing, the top-down step is employed. The top-down step allows for the reduction of contextual variability of intonational structure constituents. Software implementation of the stylization method for the Polish language is described. The design takes advantage of components borrowed from an existing automatic intonation recognizer. Fundamental frequency extraction in the design is performed using a comb filter. In a subsequent stage, a syllable-wise pitch stylization is performed, followed by contextual pitch tracking. Intonational structure is recognized by an intonational parser based on Hidden Markov Models. The intonation model conveying an annotation system is taken from the recent intonation grammar for Polish by Jassem. Components of the design were developed in parallel which allowed for the coordination of tradeoffs between the modules. Training set and exemplary results are presented together with a discussion of future improvements.

 
Poster Session 2: Analysis and Formulation of Prosody8 of 22

Evaluation of Pitch Detection Algorithms in Adverse Conditions

AUTHOR(S):
Kotnik, Bojan; University of Maribor
Hoege, Harald; Siemens AG
Kacic, Zdravko; University of Maribor

Abstract:
Robust fundamental frequency estimation in adverse conditions is important in various speech processing applications. In this paper a new pitch detection algorithm (PDA) based on the autocorrelation of the Hilbert envelope of the LP residual is compared to another well established algorithm from Goncharoff. A set of evaluation criteria is collected on which the two PDA algorithms are compared. In order to evaluate the algorithms in adverse conditions a suited reference database was constructed. This reference database consists of parts of the Spanish SPEECON speech database where recordings of 60 speakers were selected and manually pitch marked. The recordings cover several adverse conditions as noise in the car cabin and reverberations of office rooms. The evaluation highlights the good performance of the new algorithm in comparison but shows, that low SNR conditions and strong reverberation are still a demanding challenge for future pitch detection algorithms.

 
Poster Session 2: Analysis and Formulation of Prosody9 of 22

A General Approach for Automatic Extraction of Tone Commands in the Command-Response Model for Tone Languages

AUTHOR(S):
Gu, Wentao; The University of Tokyo
Hirose, Keikichi; The University of Tokyo
Fujisaki, Hiroya; The University of Tokyo

Abstract:
Although the command-response model for the process of F0 contour generation has been successfully applied to many languages, the inverse problem, viz., automatic derivation of the model parameters from an observed F0 contour, is more challenging, especially for tone languages which have both polarities of tone commands. Since the polarity of tone commands cannot be inferred directly from the F0 contour itself, the information on tone identity and timing need to be incorporated. The current study gives a general approach for the first-order estimation of tone command parameters for tone languages, taking Mandarin and Cantonese as two examples. After a rule-based recognition of the tone command patterns within each syllable, the timing and amplitude of tone commands will be deduced. The experiments show that the method gives good results of analysis for both the two dialects.

 
Poster Session 2: Analysis and Formulation of Prosody10 of 22

Comparison of Tonal Co-articulation between Intra- and Inter-word Disyllables in Mandarin

AUTHOR(S):
Wang, Xiaodong; Department of Electronic Engineering
Gu, Wentao; Department of Information and Communication Engineering
Hirose, Keikichi; Department of Information and Communication Engineering
Sun, Qinghua; Department of Electronic Engineering
Minematsu, Nobuaki; Department of Frontier Informatics

Abstract:
Features of tonal co-articulation in Mandarin speech are studied. Though several previous works investigated how prosodic features of syllables are affected by surrounding syllables, most of them selected nonsense syllable sequences as speech material without specific consideration on word boundary. In the present study, however, a comparison is given on tonal co-articulation between intra-word and inter-word cases. The speech material is designed: in each pair of sentences, target disyllables share exactly the same tonal context but differ in position of word boundary locating at the initial of the target or at the middle. Mean F0 and F0 range are adopted as prosodic features of each syllable, and mean F0's differences between the second and the first syllables of target are calculated and compared for sentence pairs. Analysis on 16 disyllabic tone combinations shows the effect of word boundary location on the tone co-articulation is different depending on the tone combinations.

 
Poster Session 2: Analysis and Formulation of Prosody11 of 22

Alignment of Medial and Late Peaks in German Spontaneous Speech

AUTHOR(S):
Niebuhr, Oliver; Institute of Phonetics and Digital Speech Processing, University of Kiel

Abstract:
Ambrazaitis, Gilbert; Center for Languages and Literature, Lund University Starting from a corpus of German spontaneous speech, the phonetic realisations of the two KIM categories medial and late peak were investigated in prenuclear position. The results show that, for both categories, the onset of the rising F0 move-ment (L) is comparably aligned around the accented-syllable onset, whereas the F0 maximum (H) is independently aligned and predominantly located before the accented-syllable offset or after the onset of the following unaccented syllable, re-spectively. The data further suggest that also from the AM point of view the two prenuclear rises are different at the pho-nological level. Finally, the possibility is pointed out that the alignment patterns found for prenuclear rises in other studies are to some extent due to a combination of categories like the medial and late peak.

 
Poster Session 2: Analysis and Formulation of Prosody12 of 22

Emotional, linguistic or just cute? The function of pitch contours in infant -and foreigner-directed speech

AUTHOR(S):
Knoll, Monja; University of Portsmouth
Uther, Maria; University of Portsmouth
MacLeod, Norman; The Natural History Museum
O'Neill, Mark; The Natural History Museum
Walsh, Stig; The Natural History Museum

Abstract:
Infant-directed speech (IDS) is characterised by acoustic modifications to adultdirected speech (ADS) including increased pitch, emotional affect and pitch contour exaggeration. Pitch contour function in IDS has not been determined, but may be important for emotional expression, gaining attention or have a linguistic role. Here two algorithmic approaches (DAISY and Eigenshape analysis) were used to analyse pitch contour shape in three speech recipient groups, with human raters as a qualitative comparison. Speech samples of target words in ten mothers were recorded while they talked to their infants and to a British- (control) and foreign adult confederate (linguistic condition). 167 pitch contours were extracted and converted to a standard format for the three approaches. Results indicate that IDS mostly contains exaggerated contours; FDS and ADS possess mainly flat curves. These results suggest an attentional-emotional role for the IDS pitch contours.

 
Poster Session 2: Analysis and Formulation of Prosody13 of 22

Tone Ratios Combined with F0 Register in Cantonese as Speaker-dependent Characteristic

AUTHOR(S):
Li, Yujia; The Chinese University of Hong Kong

Abstract:
F0 is considered to provide speaker-specific information in some extent. Based on the widely agreement that extrinsic F0 is helpful for speaker identity, this paper investigates the possibility of making use of both extrinsic and intrinsic features of Cantonese tone system as speaker-dependent characteristic. Considering the special characteristic of Cantonese tone system, relative tone ratios and F0 register are proposed to model the tone systems generated by different speakers. The investigation is carried out over both recognition and analysis. The results primarily show the potential of implementing such features on speaker characterization.

 
Poster Session 2: Analysis and Formulation of Prosody14 of 22

Functional-oriented articulatory modeling of tones and intonations

AUTHOR(S):
Prom-on, Santitham; Department of Computer Engineering, King Mongkut's University of Technology Thonburi, Thailand
Xu, Yi; Department of Phonetics and Linguistics, University College London, UK
Thipakorn, Bundit; Department of Computer Engineering, King Mongkut's University of Technology Thonburi, Thailand

Abstract:
In this paper we report results of applying the quantitative target approximation model (qTA) to simulate function-specific F0 contours in Mandarin. The qTA model is based on a set of assumptions about the biophysical and neural control mechanisms of pitch production. To simulate F0 contours for tone and focus, we extracted qTA parameters that are tone-specific and adjustment parameters that are focus-specific. The accuracy and effectiveness of this approach were tested through a series of synthesis experiments. In the baseline case, the results were fair with just tonal specifications. Further experiments showed additional improvements when the parameters became more functions-specific.

 
Poster Session 2: Analysis and Formulation of Prosody15 of 22

Analysis and Modelling of Question Intonation in American English

AUTHOR(S):
Sityaev, Dmitry; Toshiba Research Europe Ltd
Burrows, Tina; Toshiba Research Europe Ltd
Jackson, Peter; Toshiba Research Europe Ltd
Knill, Katherine; Toshiba Research Europe Ltd

Abstract:
This paper addresses the modelling in text-to-speech of the rising intonation pattern in American English which is often found in yes-no questions. A small corpus containing yes-no questions was recorded and analysed. F0 was then modelled using an automatic procedure. The paper also reports on the stability of alignment of F0 targets in rising intonation patterns.

 
Poster Session 2: Analysis and Formulation of Prosody16 of 22

A Method for Decomposing and Modeling Jitter in Expressive Speech in Chinese

AUTHOR(S):
Wang, Lei; Dept. Computer Science, Tianjin University
Li, Aijun; Institute of Linguistics, Chinese Academy of Social Sciences
Fang, Qiang; Institute of Linguistics, Chinese Academy of Social Sciences

Abstract:
Jitter is considered as one of the most crucial factors to the aim of synthesizing natural emotional speech. Unlike the traditional methods of measuring jitter in emotional speech, this paper propose that the jitter in the speech could be decomposed into two parts, that to say, deterministic jitter and random jitter. Deterministic jitter is associated with certain causes that may be the affect caused by emotion state, while random jitter is the result by random events that have nothing to do with emotion. What is more, two different methods of modeling jitter distribution are described: jitter decomposition is based on the fact that the mixed jitter can be divided into deterministic part and random part, while the algorithm based on GMM tries to simulate the shape of the histogram of jitter distribution. The result makes a qualitative analysis of the two methods. There are still much of works for us to do in the future in order to do more detail analysis and to make quantitative analysis of them.

 
Poster Session 2: Analysis and Formulation of Prosody17 of 22

Intensity as a macroprosodic variable in Czech

AUTHOR(S):
Dubêda, Tomáš; Institute of Phonetics, Charles University in Prague

Abstract:
The present paper provides an acoustic description of macrointensity patterns of stress units (prosodic words) in read Czech, as reflected by the intensity of syllable nuclei. Normalized intensity values show that there is a gradual macrodynamic decrease over the inter-pause group, followed typically by a significant intensity reset. Local intensity drops occur between the last two syllables of stress units; in addition, there is a major intensity drop before the pause. Syllables bearing perceived accents do not show intensity peaks.

 
Poster Session 2: Analysis and Formulation of Prosody18 of 22

How far can prosodic cues help in word segmentation?

AUTHOR(S):
Bartkova, Katarina; France Telecom

Abstract:
Prosodic cues are of great importance in parsing speech signal into prosodic and lexical units. Automatic speech recognition systems try to use prosodic parameters to detect boundaries of prosodic units and help thus the acoustic decoding process. Although the automatic detection of major prosodic boundaries is most of the time reliable, minor boundary detections are prone to error. A deeper understanding of the prosodic parameters in spontaneous speech would improve their modeling and their use by automatic systems. This study analyses filled and silent pause occurrences and two prosodic parameters, duration of pauses and vowels and F0 slopes, measured on a spontaneous speech corpus in French. The results of the analysis revealed that a simple local comparison of the parameter values with the values measured in the vicinity of the segment under consideration can provide valuable information on the lexical boundaries as well as on prosodic patterns of the lexical units.

 
Poster Session 2: Analysis and Formulation of Prosody19 of 22

Acoustic Features of Japanese Vowel-Vowel Hiatus at Prosodic Boundaries

AUTHOR(S):
Kitazawa, Shigeyoshi; Shizuoka University

Abstract:
We investigated V-V hiatus through J-ToBI labeling and listening to whole phrases to estimate degree of discontinuity and, if possible, to determine the exact boundary between two phrases. Appropriate boundaries were found in most cases as the maximum perceptual score. Using electroglottography (EGG) of the open quotients OQ, pitch mark and spectrogram, the acoustic phonological feature of these V-V hiatus was found as phrase-initial glottalization and phrase-final nasalization, as well as phrase-final lengthening and phrase-initial shortening of the morae. A small F0 dip was observable at the boundary of V-V hiatus was found as universal indication of glottalization. The test materials are taken from the "Japanese MULTEXT", consisting of a particle - vowel (36), adjective - vowel (5), and word - word (4).

 
Poster Session 2: Analysis and Formulation of Prosody20 of 22

Secondary Association of Tones in Castilian Spanish

AUTHOR(S):
Face, Timothy; University of Minnesota

Abstract:
This paper considers the role of secondary association of tones in Castilian Spanish. Recent studies have shown that Castilian Spanish has three contrasting bitonal rising pitch accents, posing a problem for standard Autosegmental-Metrical theory, which allows only a binary distinction between L*+H and L+H*. It is argued that this three-way contrast can be accounted for in a principled and constrained way if pitch accent tones can have secondary associations to metrical units much the same way that edge tones have been proposed to have secondary associations in several languages. In this way, primary association results in the association of the strong tone (or head) of the pitch accent with the tone-bearing unit, while secondary association more directly affects phonetic alignment. It is argued that secondary association of edge tones also exists in Castilian Spanish and is able to explain two pitch range effects that have been observed, but not explained, in previous analyses.

 
Poster Session 2: Analysis and Formulation of Prosody21 of 22

L-tone affixation: Evidence from German dialects

AUTHOR(S):
Kügler, Frank; Institut für Linguistik

Abstract:
In a comparison of the tonal grammars of two German dialects, Swabian and Upper Saxon German, we observe a particular type of intonation contour that is similar in surface form, yet differs phonologically. Phonetically, the contour's shape is rising-falling; phonologically, the Swabian contour reads as L*H +L 0%, and the one of Upper Saxon as L+ H*L 0%. Both contours are marked ones, and arise through a process that we call L-affixation, which is indicated by the '+' diacritic. Both contours share a similar semantico-pragmatic meaning, i.e. they express narrow focus. An alternative interpretation of the postnuclear low tone in Swabian as a phrase accent is rejected.

 
Poster Session 2: Analysis and Formulation of Prosody22 of 22

Rhythmic factors in weak-syllable insertion: An internet corpus study

AUTHOR(S):
Quené, Hugo; Utrecht University

Abstract:
Dutch language users often insert an inflectional schwa after an adverb, in certain grammatical constructions. The main hypothesis here is that this insertion, which is often ungrammatical, is driven by speakers' tendency towards regular speech rhythm, which overrides the fine grammatical nuances conveyed by absence of inflection. This rhythmicity hypothesis was investigated in a huge text corpus, viz. all web pages written in Dutch. The proportion of weak-syllable insertion was obtained for a sample of test phrases, varying in rhythmic context around the insertion point. Logistic regression of these proportions shows large and significant effects of rhythmic context on the odds of weak-syllable insertion. Hence, this insertion may well be due to rhythmical factors in speech production, in addition to lexical-grammatical factors.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Special Session 2 (SPS 2) Audio-Visual Prosody Processing
Organizers: Marc Swerts, Denis Burnham and Sascha Fagel
Tuesday, May 2, 16:00 - 18:00
Special Session 2: Auditory-Visual Prosody Processing1 of 6

Measuring and modeling audiovisual prosody for animated agents

AUTHOR(S):
Granström, Björn; Center for Speech Technology, KTH
House, David; Center for Speech Technology, KTH

Abstract:
Understanding the interactions between visual expressions, dialogue functions and the acoustics of the corresponding speech presents a substantial challenge. The context of much of our work in this area is to create an animated talking agent capable of displaying realistic communicative behavior and suitable for use in conversational spoken language systems, e.g. a virtual language teacher. In this presentation we will give some examples of recent work, primarily at KTH, involving the collection and analysis of a database for audiovisual prosody. We will report on methods for the acquisition and modeling of visual and acoustic data, and provide some examples of analysis of head nods and eyebrow settings.

 
Special Session 2: Auditory-Visual Prosody Processing2 of 6

Hearing and Seeing Beats: The influence of visual beats on the production and perception of prominence

AUTHOR(S):
Krahmer, Emiel; Tilburg University
Swerts, Marc; Tilburg University

Abstract:
Speakers can employ a variety of means to indicate that a word is important, including pitch accents and visual cues such as manual gestures, head nods and eyebrow movements (collectively referred to as visual beats). In this paper, we look at the relation between visual and auditory cues for prominence, based on data collected with an original experimental paradigm in which speakers were instructed to realize a particular target sentence with different distributions of auditory and visual cues. The first experiment revealed that visual beats have a significant effect on the spoken realization of the target words. When a speaker produces a visual beat, the word uttered simultaneously is produced with relatively more spoken emphasis, irrespective of the position of the auditory accent. The second experiment showed that when participants see a speaker realize one of these beat gestures on a word, they perceive this word as more prominent than when they do not see the beat gesture.

Special Session 2: Auditory-Visual Prosody Processing3 of 6

Manipulating Uncertainty: The contribution of different audiovisual prosodic cues to the perception of confidence

AUTHOR(S):
Dijkstra, Christel; Tilburg University
Krahmer, Emiel; Tilburg University
Swerts, Marc; Tilburg University

Abstract:
When answering factual questions, speakers can signal whether they are uncertain about the correctness of their answer using prosodic cues such as fillers ("uh"), a rising intonation contour or a marked facial expression. It has been shown that on the basis of such cues, observers can make adequate estimates about the speaker's level of confidence, but it is unclear which of these cues have the largest impact on perception. To find the relative strength of the three aforementioned cues, a novel perception experiment was performed in which answers were artificially manipulated in such a way that all possible combinations of the cues of interest could be judged by participants. Results showed that while all three factors had a significant influence on the perception results, this effect was by far the largest for facial expressions.

 
Special Session 2: Auditory-Visual Prosody Processing4 of 6

Visual Correlates of Prosodic Contrastive Focus in French: Description and Inter-Speaker Variability

AUTHOR(S):
Dohen, Marion; Institut de la Communication Parlée / Human Information Science Research Labs - ATR
Lœvenbruck, Hélène; Institut de la Communication Parlée
Hill, Harold; Human Information Science Research Labs - ATR

Abstract:
This study is a follow-up of previous studies we conducted on the visible articulatory correlates of French prosodic contrastive focus. A two speaker analysis using an automatic lip-tracking device had shown that these correlates existed and were used in visual perception. However the articulatory strategies depended on the speaker. The purpose of this study was thus to extend the analysis to other speakers, examine the similarities and variabilities and try to identify global tendencies. We recorded five speakers of French with a 3D optical tracker using a 13 sentence (subject-verb-object) corpus and four focus conditions (S, V, O or neutral). An articulatory analysis confirmed that visible articulatory correlates exist for all the speakers. The strategies used are mainly of two types: absolute and differential. An analysis of other facial movements showed that an eyebrow raising and/or a head nod can signal focus. This association is however highly inter- and intra-speaker dependent.

 
Special Session 2: Auditory-Visual Prosody Processing5 of 6

Audio and Audio-visual Effects of a Short English Emotional Sentence on Japanese L2's and English L1's Cognition, and Physio-acoustic Correlate

AUTHOR(S):
Isei-Jaakkola, Toshiko; The University of Tokyo
Sun, Qinghua; The University of Tokyo
Hirose, Keikichi; The University of Tokyo

Abstract:
The cognition test results of audio (A) and audio-visual (AV) effects on nine English emotions in a short sentence were compared to the physio-acoustic features of sound used for the cognition tests. Two groups of Japanese learners of English (JL2) and one group of English speakers (EL1) participated in these A and AV cognition tests. In the physio-acoustic analyses we used F0 and intensity contours and calculated the area of sentential patterns and three forms of distance: area-, average, and pattern-distance for each emotion. It was found that the order of the correct answer ratios using dialogues, a short statement, and a word, was: dialogues > short statement > word in A, and word > short statement in AV. The relationships between these cognition tests and physio-acoustic analyses confirmed that although there was not high correlation between them, intensity seems to be more correlated to the cognition test results for audio by both JL2 and EL1 than F0.

 
Special Session 2: Auditory-Visual Prosody Processing6 of 6

Emotional McGurk Effect

AUTHOR(S):
Fagel, Sascha; Technical University Berlin

Abstract:
Speaking is a physiological process that manifests in the acoustic and in the optic domain and hence is audible and visible. These two modalities influence each other in perception. Under normal circumstances the speech information in both channels is coherent and complementary and integrated to a percept. But if the information is conflicting and nevertheless integrated then the percept in one of the modalities might be changed by the other modality. The experiment described here discovers that when the video of an utterance spoken in one emotion is dubbed with the audio of the utterance spoken in another emotion the perceived emotion might be a third - neither present in the auditory nor in the visual modality.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Oral Session 1 (OS 1) Prosodic Variability
Wednesday, May 3, 09:45 - 11:25
Chair: Gösta Bruce
Oral Session 1: Prosodic Variability1 of 5

Pronunciation Variant Selection for Spontaneous Speech Synthesis - A Summary of Experimental Results

AUTHOR(S):
Werner, Steffen; Dresden University of Technology
Hoffmann, Rüdiger; Dresden University of Technology

Abstract:
To make synthesized speech more natural and colloquial the regularity of synthesized speech has to be overcome and spontaneous speech effects have to be integrated into the synthesis process. In a first step towards spontaneous speech we introduced different duration control methods in speech synthesis. In this paper we summarize the results of previous works of changing the speaking rate indirectly by controlling the grapheme-to-phoneme conversion through different pronunciation variant selection algorithms. The presented results of listening experiments show a significant improvement in the category colloquial impression. To evaluate the quality of the most outstanding variant selection approach compared to the canonical synthesis, we performed a new listening test on longer speech samples. The variant synthesis applying a pronunciation variant sequence model achieved a significant lower listening effort and a higher overall rate (MOS) compared to the canonical synthesis.

 
Oral Session 1: Prosodic Variability2 of 5

Explaining cross-linguistic differences in effects of lexical stress on spoken-word recognition

AUTHOR(S):
Cutler, Anne; Max Planck Institute for Psycholinguistics
Pasveer, Dennis; Max Planck Institute for Psycholinguistics

Abstract:
Experiments have revealed cross-language differences in listeners' use of stress information in recognising spoken words. Previous comparisons of the Spanish and English vocabularies suggested that the differences might reflect the extent to which considering stress in spoken-word recognition allows rejection of unwanted competition from embedded words. This hypothesis was tested on the vocabularies of Dutch and German, for which word recognition results resemble those from Spanish more than those from English. The vocabulary statistics likewise revealed that in each language, the reduction of embeddings resulting from consideration of stress more closely resembles the reduction achieved in Spanish than in English.

 
Oral Session 1: Prosodic Variability3 of 5

Dialect Alignment Signatures

AUTHOR(S):
Ní Chasaide, Ailbhe; Phonetics and Speech Laboratory, Trinity College Dublin
Dalton, Martha; Phonetics and Speech Laboratory, Trinity College Dublin

Abstract:
This paper considers the hypothesis that dialects may have characteristic patterns in the alignment of the melodic contour with the segmental or syllabic tiers. Peak alignment was measured in initial prenuclear accented syllables for 3 dialects of Connaught Irish, Cois Fharraige, Inis-Oirr and Mayo. The size of the anacrusis varied as between two (PN2), one (PN1) and no (PN0) unstressed syllables before the accented one. Results support the hypothesis and indicate that the finetiming of peak alignment does differ systematically among the three dialects. In the first, Cois Fharraige, peaks remain fixed across anacrusis conditions, being aligned to the right edge of the accented syllable. The two other dialects reveal more variable peak timing: Inis Oirr is moderately variable showing a tendency for the peak to fall within the stressed vowel, but shifting rightwards to the syllable boundary when there is no anacrusis (PN0). The Mayo dialect is extremely variable across the prenuclear conditions. It is argued that such fine time alignment differences may be important to the differentiation of even closely related dialects.

 
Oral Session 1: Prosodic Variability4 of 5

Emotional Prosody -Does Culture Makes A Difference?

AUTHOR(S):
Burkhardt, Felix; T-Systems Enterprise Services
Audibert, Nicolas; ICP - University of Stendhal, Grenoble
Malatesta, Lori; IVML - Technical University of Athens
Türk, Oytun; R&D Dept., Sestek Inc., Istanbul
Arslan, Levent; R&D Dept., Sestek Inc., Istanbul
Aubergé, Véronique; ICP - University of Stendhal, Grenoble

Abstract:
We report on a multilingual comparison study on the effects of prosodic changes on emotional speech. The study was conducted in France, Germany, Greece and Turkey. Semantically identical sentences expressing emotional relevant content were translated into the target languages and were manipulated systematically with respect to pitch range, duration model, and jitter simulation. Perception experiments in the participating countries showed relevant effects irrespective of language. Nonetheless, some effects of language are also reported.

 
Oral Session 1: Prosodic Variability5 of 5

Estonian and English rhythm: a two-dimensional quantification based on syllables and feet

AUTHOR(S):
Asu, Eva Liina; Institute of the Estonian Language
Nolan, Francis; University of Cambridge

Abstract:
This paper expands a recent pilot experiment on Estonian rhythm within the quantificational approach to the study of rhythm, using the Pairwise Variability Index (PVI). The PVI expresses the average difference between adjacent phonological units such as vowels, consonantal intervals or syllables. It is argued here that confining the application of the PVI to the level of the syllable (or its components) misses the essence of Estonian rhythm and indeed of phonetic rhythm in general, and the first experiment reported in this paper quantifies Estonian rhythm in terms of the durational PVI of both the syllable and (innovatively) the foot. In the second experiment, results are compared with the same measures for another language with strong stress, English. Both languages have a similar, relatively low foot PVI, but English has a considerably higher syllable PVI reflecting its radical reduction of unstressed syllables in polysyllabic feet.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Oral Session 2 (OS 2) Prosody in Dialogue Speech
Wednesday, May 3, 11:50 - 13:10
Chair: Mark Hasegawa-Johnson
Oral Session 2: Prosody in Dialogue Speech1 of 4

Intonational variation in adolescent conversational speech: rural versus urban patterns

AUTHOR(S):
Fletcher, Janet; University of Melbourne
Loakes, Deborah; University of Melbourne

Abstract:
The conversational speech of ten female adolescents was analyzed intonationally with a view to determining whether there is variation between rural and urban varieties in Australian English. The data revealed that urban females use marginally more 'uptalk' than their rural counterparts, as well as more sustained, level tunes. These differences and other aspects of intonational variation are presented in terms of a prevailing intonational model of English, and discourse annotation schema.

 
Oral Session 2: Prosody in Dialogue Speech2 of 4

The Friendliness Perception of Dialogue Speech

AUTHOR(S):
Tao, Jianhua; Institute of Automation, Chinese Academy of Sciences, Beijing
Huang, Lixing; Institute of Automation, Chinese Academy of Sciences, Beijing
Kang, Yongguo; Institute of Automation, Chinese Academy of Sciences, Beijing
Yu, Jian; Institute of Automation, Chinese Academy of Sciences, Beijing

Abstract:
The paper is focused on the friendliness analysis and perception of dialogue speech. To do that, the paper uses a concept of the "perception vector" which contains the information of emotions and softness. In creating the "perception vector", and to simulate the perception ambiguity, the paper allows the listeners to label the speech with multiple emotions, and align them into "one choice", "first choice" and "second choice". Then, the paper makes the correlation analysis between friendliness and "perception vectors", the results disclose that the friendliness is positive correlation to "softness", "happiness" and "anger". Finally the paper traines a classification tree model to predict friendliness degree from acoustic features. With the classification tree model, we get the ranking scores of the acoustic parameters' importance for perceptually synthesized speech. Results shows that the F0 mean assumes the most important role in emotion perception, Ee is the most important parameter related to voice quality for the perception model.

 
Oral Session 2: Prosody in Dialogue Speech3 of 4

Immediate effects of intonational prominence in a visual search task

AUTHOR(S):
Ito, Kiwako; Linguistics, Ohio State University
Speer, Shari R.; Linguistics, Ohio State University

Abstract:
Studies of spontaneous speech show that speakers consistently mark contrastive words using pitch accent. To investigate how listeners process contrastive accentual prominence, eye-movements were monitored as participants listened to directions and searched for ornaments to decorate holiday trees. Eye movements to target ornament cells were earlier when intonation felicitously marked contrast on a color adjective (e.g. First, hang the green drum!Next, hang the ORANGE drum) than when it did not (! orange DRUM). Felicitous emphatic accent placement induced earlier fixations to the target compared to lack of emphasis (! orange drum). In addition, infelicitous use of accent on the modifier (e.g. green drum ! ORANGE ball) led to incorrect initial fixations to the preceding cell (e.g. drum) before the noun itself was processed. These results demonstrate immediate processing of accentual information on a modifier leading to a strong expectation about the upcoming discourse entity.

 
Oral Session 2: Prosody in Dialogue Speech4 of 4

Spoken Dialogue System Using Recognition of User's Feedback for Rhythmic Dialogue

AUTHOR(S):
Fujie, Shinya; Waseda University
Miyake, Riho; Waseda University
Kobayashi, Tetsunori; Waseda University

Abstract:
The recognition method of user's feedback during the system's utterance is proposed and its application to the spoken dialogue system is discussed. In human conversation, we can know the dialogue partner's internal state by receiving such feedbacks. Our research topics are (1) developing the prosodic information based feedback recognizer and (2) appropriately controlling the system's utterance timing along with the user's feedbacks. The implemented recognizer can distinguish between back-channel and ask-back word-independently with prosodic information based features and statistical recognition method. Experiments of the spoken dialogue system with this function reveals when it should generate the next utterance after receiving the user's feedback.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Poster Session 3 (PS 3) Prosody and Speech Production
Wednesday, May 3, 14:30 - 16:00
Chair: Grażyna Demenko
Poster Session 3: Prosody and Speech Production1 of 24

Register in Mah Meri: A preliminary phonetic analysis

AUTHOR(S):
Stevens, Mary; University of Melbourne
Kruspe, Nicole; University of Melbourne
Hajek, John; University of Melbourne

Abstract:
This paper presents the results of a first phonetic investigation of register in Mah Meri, a Southern Aslian language spoken in Peninsular Malaysia, and part of the larger Austroasiatic family spread throughout South and Southeast Asia. Voice register, a complex of laryngeal and supralaryngeal properties, is a common areal feature amongst members of the Austroasiatic family (particularly the Mon-Khmer group) but has never previously been reported to occur in an Aslian language. We consider general spectral appearance, duration and f0 in order to see how well they correlate with perceived differences in register.

 
Poster Session 3: Prosody and Speech Production2 of 24

Prosody As Marker of Discourse Segmentation in Suyá

AUTHOR(S):
Oliveira, Jr., Miguel; University of Manchester

Abstract:
The present study investigates whether - as in several well-documented languages -, prosody plays a role in the signaling of discourse segmentation in Suyá, an Amazonian language of the Jê group. Inspired by the literature, the following prosodic variables were selected for analysis: pause, pitch reset and boundary tones.

 
Poster Session 3: Prosody and Speech Production3 of 24

Quantitative analysis of intonation patterns in statements and questions in Can-tonese

AUTHOR(S):
Ma, Joan K.-Y.; University of Hong Kong
Ciocca, Valter; University of Hong Kong
Whitehill, Tara L.; University of Hong Kong

Abstract:
The aim of this study was to investigate intonation patterns in Cantonese using a quantitative approach. The command-response model was employed to explore the differences between intonations, and the effects of lexical tone on fundamental frequency contours of intonation. Two intonation types, with six tonal contrasts embedded at the final position, were collected from twelve native Cantonese speakers. Results showed that F0 in questions was raised for the entire utterance, which was mainly associated with baseline frequency changes. An additional positive boundary tone command occurred towards the end of the final syllable of questions, which denoted the final-rise in F0 in questions. A lengthened duration of the tone command towards the end of questions was also observed. The amplitude of the final-rise in the contours of questions was affected by the tone of the final syllable, with significantly higher amplitude noted for the boundary tone command of tones 25 and 21.

 
Poster Session 3: Prosody and Speech Production4 of 24

Interaction between the Scottish English System of Prominence and Vowel Length

AUTHOR(S):
Gordeeva, Olga; Speech Science Research Centre, Queen Margaret University College, Edinburgh

Abstract:
This study looks into interaction between the quasi-phonemic vowel length contrast in Scottish English and its word-prosodic system. We show that under the same phrasal accent the phonetically short vowels of the morphologically conditioned quasi-phonological contrast are produced with significantly more laryngeal effort (spectral balance) than the long ones, while the vowels do not differ in quality, overall intensity or fundamental frequency. This difference is explained by employing the concept of "functional load". Duration must be kept short to mark the short vowel length, while both word-stress and phrasal accent require lengthening. Therefore, the additional laryngeal effort in the short vowels serves a prominenceenhancing function. This finding a supports the hypothesis proposed by Beckman that phonological categories of word-prosodic systems featuring "stress-accent" are not necessarily phonetically uniform language-internally.

 
Poster Session 3: Prosody and Speech Production5 of 24

Preliminary Results of Prosodic Effects on Domain-initial Segments in Hamkyeong Korean

AUTHOR(S):
Kim, Sung-A; Dong-A University, Korea

Abstract:
This paper investigates the domain-initial strengthening in English and Hamkyeong Korean, a pitch accent dialect spoken in the northern part of North Korea. The question addressed in the present study is whether the domain-initial strengthening effect is observed at the domain-initial vowels as well as domain-initial consonants. In the experiment, durations of initial-syllable vowels in various prosodic domains were compared to those of second vowels in real-word tokens for both languages. Hamkyeong Korean, like English, tuned out to strengthen the domain-initial consonants. With regard to vowel durations, we found no significant prosodic effect in English. Yet, Hamkyeong Korean showed significant differences between durations of initial and non-initial vowels in the prosodic domains. The findings in the study are theoretically important as they show that the potentially-universal phenomenon of initial strengthening is subject to language specific variations in its implementation.

 
Poster Session 3: Prosody and Speech Production6 of 24

Syntax and syllable count as predictors of French tonal groups: Drawing links to memory for prosody

AUTHOR(S):
Gilbert, Annie C.; Université de Montréal
Boucher, Victor J.; Université de Montréal

Abstract:
While the role and origin of prosodic structures remain unclear, there is evidence that prosody bears an intriguing relationship with serial memory processes and grouping effects. This link is seen in the fact that the recall of presented prosodic patterns and their production in speech are both restricted in term of a syllable count. The present experiment complements previous studies by examining the effects of syntactic structure as opposed to constituent length on produced tonal groups. Forty subjects produced, in quasi-spontaneous conditions, given utterances with differing NP, VP structures or differing lengths. The results show that constituent length is the major predictor, whereas syntactic structure appears as a secondary factor.

 
Poster Session 3: Prosody and Speech Production7 of 24

Articulatory Strengthening and Prosodic Hierarchy

AUTHOR(S):
Cao, Jianfeng; Institue of Linguistics, Chinese Academy of Social Sciences
Zheng, Yuling; Institue of Ethnology and Anthropology, Chinese Academy of Social Sciences

Abstract:
This paper reports a set of results based on the spectral and EPG measurements to the read speech copra in Mandarin Chinese, aim at the observation on the relationship between articulatory strengthening and prosody hierarchy. The data obtained both from acoustic and physiological measurements indicate that, articulatory manifestation of any segment in real speech are closely relevant to their prosodic position or status in connected speech. Therefore, it makes capable to predict the hierarchical organization of speech prosody from the strength of such articulatory strengthening. At the same time, this evidence further reveals the existence of anticipatory planning in speech production. Consequently, our finding should be not only of benefit for Chinese speech processing, but also provides a new angle of view to understand the mechanism of speech production in general.

 
Poster Session 3: Prosody and Speech Production8 of 24

Articulatory and acoustic correlates of prenuclear and nuclear accents

AUTHOR(S):
Mücke, Doris; IfL Phonetik, University of Cologne
Grice, Martine; IfL Phonetik, University of Cologne
Becker, Johannes; IfL Phonetik, University of Cologne
Hermes, Anne; IfL Phonetik, University of Cologne
Baumann, Stefan; IfL Phonetik, University of Cologne

Abstract:
We investigate acoustic and articulatory anchors for F0 targets corresponding to prenuclear and nuclear accent peaks in German, both across two different articulation rates and across two different syllable structures. We found that the alignment of turning points in the F0 signal with minima and maxima in the kinematic signal was more stable than with segment boundaries in the acoustic signal. Whereas in Dutch the H peak of a rising prenuclear (L*+H) accent has been shown to occur at the edge of the accented syllable, in German the peak occurs during the vowel in the postaccented syllable. In articulatory terms, the peak aligns with articulatory gestures corresponding to the vowel. Like in English and Dutch, nuclear peaks in German are aligned earlier in the acoustic signal than prenuclear ones. The alignment of F0 peaks with the kinematic signal was highly systematic, and can be interpreted as a shift from a gesture corresponding to a vowel to a gesture corresponding to a consonant.

 
Poster Session 3: Prosody and Speech Production9 of 24

Prosodic Marking of Focus Domains - Categorical or Gradient?

AUTHOR(S):
Baumann, Stefan; IfL-Phonetik, University of Cologne
Grice, Martine; IfL-Phonetik, University of Cologne
Steindamm, Susanne; IfL-Phonetik, University of Cologne

Abstract:
This paper reports on a production experiment in German eliciting focus domains of various sizes, ranging from broad to narrow focus, as well as contrastive focus. Results show that speakers use categorical as well as gradient prosodic means to indicate different focus structures, with an increase of prominence-lending cues as the focus domain narrows. Contrast is shown to enhance certain differences between narrow and broad focus. There is a clear indication that speakers differ considerably as to the combination of strategies they employ for marking focus structure.

 
Poster Session 3: Prosody and Speech Production10 of 24

L tone downtrends in Korean across utterance types

AUTHOR(S):
Kim, Kyung-hee; IfL-Phonetik, University of Cologne

Abstract:
Research on global pitch trends has shown that statements and different types of questions in Dutch all display distinct patterns, and suggests that these may be influenced by the presence of accentual prominence on wh-words and whether syntactic cues to interrogativity are present. This implies that there would be different pitch trends in a language such as Korean which lacks accentual prominence and which does not have to have an interrogative syntax in unmarked yes-no questions. We test this implication by comparing the results in [11] with similar statements and question types in Korean, concentrating in this paper on the scaling of L tones. Further, we differentiate between the pitch trends towards the end of the utterances and those in the rest of the utterance, so as to investigate the contribution of final lowering to the shape of global trends.

 
Poster Session 3: Prosody and Speech Production11 of 24

The domain of realization of the L-phrase tone in American English

AUTHOR(S):
Barnes, Jonathan; Boston University
Shattuck-Hufnagel, Stefanie; MIT
Brugos, Alejna; Boston University
Veilleux, Nanette; Simmons College

Abstract:
The phonetic realization of intonational targets in the f0 contour is not always straightforwardly predicted by their affiliations in the segmental string, and the phrase tones of American English are a type of target for which several hypotheses about the domain of realization have been advanced. By varying the metrical structure of target words at the end of a phrase produced with the H* L- H% 'surprised dismay' contour, we determined that a) the right edge of the L-, signaled by the beginning of the rise for the H%, occurs close to the right edge of the phrase, b) the left edge of the L-, signaled by the end of the fall from the H*, stretches leftward to seek a prominent syllable, and c) there is significant variation in the resolution of the various factors that influence these two inflection point locations.

 
Poster Session 3: Prosody and Speech Production12 of 24

Prosodic Encoding of Topic and Focus in Mandarin

AUTHOR(S):
Wang, Bei; University of Potsdam
Xu, Yi; University College London & Haskins Laboratories

Abstract:
In this study, we investigate whether and how focus and topic can be separately encoded in Mandarin. A total of 60 sentences with three lengths and five tone combinations were recorded in four topic-focus conditions: initial focus, new topic, implicit topic and given topic, by six speakers. The results of acoustic analysis show that new topic is encoded with a raised pitch range on the initial word. Focus, in contrast, is encoded with an expanded pitch range on the focused word and a suppressed pitch range on the subsequent words.

 
Poster Session 3: Prosody and Speech Production13 of 24

Contextual Tonal Variations and Pitch Targets in Cantonese

AUTHOR(S):
Wong, Ying Wai; The Chinese University of Hong Kong

Abstract:
With Cantonese as the target language, this study investigates the phonetic details of contextual tonal variations in disyllabic tonal sequences. It was found that the main source of F0 (fundamental frequency) contour deviation from the canonical form comes from carryover effect, which is assimilatory in nature. Furthermore, based on the Target Approximation (TA) model, an optimization problem was formulated as an attempt to unveil mathematically pitch targets of the six lexical tones in Cantonese. Finally, implications of our results on tone production and perception are discussed.

 
Poster Session 3: Prosody and Speech Production14 of 24

Realization of Cantonese Rising Tones under Different Speaking Rates

AUTHOR(S):
Wong, Ying Wai; The Chinese University of Hong Kong

Abstract:
The two Cantonese rising tones, high-rising and low/mid-low rising tones, were found to maintain their distinct slopes of F0(fundamental frequency)-rise and offset F0 under different speaking rates. This suggests the two as possible acoustic cues for rising tone discrimination. The rising contours, under whichever speaking rate, reside in area temporally near the syllable offset. Furthermore, through tests with different alignment methods, the rising contours were found to show the most significant overlap when aligning with offset of the host syllable. Finally, discussions on characterization of rising tones within the Target Approximation (TA) model are presented.

 
Poster Session 3: Prosody and Speech Production15 of 24

Thai tonal contrast under changes in speech rate and stress

AUTHOR(S):
Nitisaroj, Rattima; Georgetown University

Abstract:
This study investigates how the five lexical tones in Thai are realized on primary-, secondary-, and unstressed syllables produced at fast, normal and slow rate. The results revealed that 1) speech rate does not have any significant effect on F0 height, excursion size and F0 peak and valley location of Thai tones, 2) tones on primary-stressed syllables have a larger excursion size than those on secondaryand unstressed syllables, and 3) the five-way tonal contrast in the language is maintained regardless of changes in speech rate and stress.

 
Poster Session 3: Prosody and Speech Production16 of 24

Rate sensitivity of syllable in French: a perceptual illusion?

AUTHOR(S):
Pasdeloup, Valérie; Université de Rennes 2 & LPL, UMR 6057 CNRS, Université d'Aix-en-Provence
Espesser, Robert; LPL, UMR 6057 CNRS, Université d'Aix-en-Provence Faraj, Malika

Abstract:
This study takes place within the framework of Gestalt theory. The aim of this work is to determine the way the prosodic scene reorganises itself according to the variation of speech rate. How do the forms constituted by stressed syllables interact with the ground of unstressed syllables? We present a study of the temporal structure of a one thousand word speech corpus. The corpus was produced at three different rates (normal, fast and slow) by one speaker with two repetitions. The goal is to constrain the rhythmical structure of speech in order to observe how rhythmic patterns depend on the variation of speech rate. Results show that rhythm is not elastic. When speech rate changes, syllabic duration does not vary in the same way for stressed and for unstressed syllables. Unstressed syllables have very little elasticity compared with stressed syllables. This result supports the hypothesis that the unstressed syllable is an anchor point in the rhythmic structure of French.

 
Poster Session 3: Prosody and Speech Production17 of 24

Production of word stress in German: Children and adults

AUTHOR(S):
Schneider, Katrin; Institute for Natural Language Processing, University of Stuttgart
Möbius, Bernd; Institute for Natural Language Processing, University of Stuttgart

Abstract:
This study investigates the acoustic correlates of contrastive word stress in bisyllabic and trisyllabic German words, produced by children and their parents. Results of the acoustic analysis of speech data are reported that were collected from three children aged 2;3 to 6;1 and their mothers during a period of two years. The results suggest that German children between 2 and 6 years of age are able to produce contrastive word stress but differ in their choice and usage of the parameters that mark stress. We found that, for German, vowel duration is the most reliable correlate of word stress in the utterances produced by all three children as well as their mothers. Adult-like usage of fundamental frequency, intensity, and several voice quality parameters appears to be acquired later than that of duration; this observation may be confounded by the finding that these parameters appear to be used less consistently than duration to mark stress even by the mothers.

 
Poster Session 3: Prosody and Speech Production18 of 24

Stress and Accent in Catalan and Spanish: Patterns of duration, vowel quality, overall intensity, and spectral balance

AUTHOR(S):
Prieto, Pilar; ICREA-UAB
Ortega-Llebaria, Marta; University of Texas-Austin

Abstract:
This article is concerned with the acoustic correlates that characterize stress and accent in Catalan and Spanish. We analyzed four acoustic correlates of stress (syllable duration, vowel quality, overall intensity, and spectral balance) in stressed and unstressed syllables in both accented and unaccented positions. Given that Spanish and Catalan differ greatly in their use of vowel reduction to mark stressed positions, we test whether they will also differ in the way they use the other acoustic correlates to signal the presence of stress and accent. Along with the findings of Slujter & collaborators (1996, 1997) and Campbell & Beckman (1997) on Dutch and English, Catalan and Spanish reveal systematic differences in the acoustic characterization of stress and accent. Specifically, while syllable duration, vowel quality, and spectral tilt are reliable acoustic correlates of the stress difference in both languages, accentual differences are acoustically marked by overall intensity cues.

 
Poster Session 3: Prosody and Speech Production19 of 24

Acoustic Cues of Stress and Accent in Catalan

AUTHOR(S):
Astruc-Aguilera, Llüisa; University of Cambridge (from Feb 2006, Associate Lecturer, The Open University)
Prieto, Pilar; Universitat Autónoma de Barcelona

Abstract:
This paper examines the phonetic correlates of stress and accent in Catalan. We analyzed five acoustic correlates of stress (syllable duration, spectral balance, vowel quality, vowel pitch, and vowel intensity) in two stress conditions and in two accent conditions, which is to say, in stressed and unstressed syllables in both accented and unaccented environments (that is, appositions in sentences such as Vol la vela, la vella '(S)he wants the sail, the old sail' vs. right-dislocated subjects in Vol la vela, la vella '(S)he wants the sail, the old lady'. Along with the findings of Slujter & collaborators and Campbell & Beckman on Dutch and on English, Catalan reveals systematic differences in the acoustic characterization along the accent and stress dimensions. Syllable duration, spectral balance, and vowel quality are reliable acoustic correlates of the stress differences, while accentual differences are acoustically marked by intensity and pitch cues.

 
Poster Session 3: Prosody and Speech Production20 of 24

Boundaries and Tonal articulation in Taiwanese Min

AUTHOR(S):
Pan, Ho-hsien; National Chiao Tung University
Tai, Yi-hsin; National Chiao Tung University

Abstract:
This study investigated the effect of the boundary on Taiwanese falling tones at the domain final and domain initial positions across the intonational phrase (IP), tone group (TW), word (WRD) and syllable (SYL)boundaries. The boundaries were placed at the same position within sentences produced with broad focus. The results showed that at domain-final, the f0 of falling tones decreased at a slower rate before IP and TW boundaries than before WRD and SYL boundaries. On the contrary, at the domain initial position, the ranking for f0 decreasing rate was IP, then TW, then SYL, and finally WRD. It is proposed that f0 decreasing rate, reflecting the vocal fold vibration, varies as a function of approaching and receding boundaries. At supra-segmental levels, the velocity of f0 decrease slows down as the approaching boundary weakens, whereas the velocity of f0 descending speeds up as receding boundary strengthens.

 
Poster Session 3: Prosody and Speech Production21 of 24

Declination and supra-laryngeal articulation in Cantonese - EPG study

AUTHOR(S):
Yuen, Ivan; Queen Margaret University College

Abstract:
Supra-laryngeal declination was reported in Italian and English. Such findings suggest that declination is not confined to the laryngeal sub-system and its acoustic output - F0. This paper intended to examine the supra-laryngeal articulation and declination in Hong Kong Cantonese (a tone language) and tested whether declination also affects supra-laryngeal articulation. In light of recent findings in the effect of prosodic positions on articulation, it is the second goal of this paper to investigate any interaction of prosodic positions and declination on supra-laryngeal articulation. Results showed no supra-laryngeal declination; however, declination interacts with prosodic positions in F0 scaling.

 
Poster Session 3: Prosody and Speech Production22 of 24

Effects of stress on intonational structure in Greek

AUTHOR(S):
Baltazani, Mary; University of Ioannina

Abstract:
This paper presents the results of a production experiment that examines the effects of stress on the realization of tonal events in the intonation of Greek. Words in three different stress categories - final, penultimate and antepenultimate stress - were examined in two different prosodic positions: at the edge of an intermediate phrase and in phrase medial position. The results show that stress position affect the alignment and scaling of tones at the edge of an intermediate phrase but not in phrase medial position. Moreover, a phrase final word showed considerable longer duration than the same word in phrase medial position.

 
Poster Session 3: Prosody and Speech Production23 of 24

Time-domain Noise Subtraction Applied in the Analysis of Lombard Speech

AUTHOR(S):
Mixdorff, Hansjörg; TFH Berlin University of Applied Sciences
Grauwinkel, Katja; TFH Berlin University of Applied Sciences
Vainio, Martti; University of Helsinki

Abstract:
This paper presents results of the comparison between speech produced in silence and speech in noise, also known as Lombard speech. A temporal filtering algorithm was developed which successfully removes the ambient noise from recordings of Lombard speech by locating and subtracting a recording of the noise performed in the same environment. The filtering algorithm yields overall noise attenuation between 15 and 30 dB without distorting the speech signal like spectral filtering approaches. In the subsequent acoustic analyses we examined the effect of varying levels of noise on vowel formants, glottal spectra and intensity. For most vowels we found significant rises in F1 and F2, but little variation in formant bandwidth. The overall rise in intensity between silent and 80 dB babble noise conditions was found to be of 9 dB. With growing effort higher harmonics are boosted by up to 6 dB whereas the average speech rate only drops by 5 %.

 
Poster Session 3: Prosody and Speech Production24 of 24

Lombard speech: Auditory (A), Visual (V) and AV effects

AUTHOR(S):
Davis, Chris; Department of Psychology, The University of Melbourne
Kim, Jeesun; Department of Psychology, The University of Melbourne & Graduate School of Education, Sejong University
Grauwinkel, Katja; TFH Berlin University of Applied Sciences
Mixdorff, Hansjörg; TFH Berlin University of Applied Sciences

Abstract:
This study examined Auditory (A) and Visual (V) speech (speech-related head and face movement) as a function of noise environment. Measures of AV speech were recorded for 3 males and 1 female for 10 sentences spoken in quiet as well as four styles of background noise (Lombard speech). Auditory speech was analyzed in terms of overall intensity, duration, spectral tilt and prosodic parameters employing Fujisaki model based parameterizations of F0 contours. Visual speech was analyzed in terms of Principal Components (PC) of head and face movement. Compared to speech in quiet, Lombard speech was louder, of longer duration, had more energy at higher frequencies (particularly with babble speech) and had greater amplitude mean accent and phrase commands.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Poster Session 4 (PS 4): Syntax, Semantics, Pragmatics and Prosody
Wednesday, May 3, 14:30 - 16:00
Chair: Zdena Palková
Poster Session 4: Syntax, Semantics, Pragmatics1 of 21

A dynamical model for generating prosodic structure

AUTHOR(S):
Barbosa, Plinio; IEL/State University of Campinas

Abstract:
The performance of the Monnin-Grosjean (MG) algorithm for predicting prosodic structure is compared with that of a system of dependency-grammar-based local markers (the DG system). Analyses of Brazilian Portuguese paragraphs read by five speakers reveal that the MG algorithm performs as well as the DG system when V-to-V normalised durations at word and phrase stress boundaries are used as indexes of prominence. These two procedures, however, have proved unsuccessful in dealing with individual variability. To overcome such a limitation, a dynamical model is proposed. By coupling syntactic and regularity constraints the main advantage of the model is the plausible simulation of speaker variability. Seven simulations were caried out by changing three model parameters: coupling strength, conditional probability of phrase stress placement, and V-to-V duration mean.

 
Poster Session 4: Syntax, Semantics, Pragmatics2 of 21

An automatic method for revising ill-formed sentences based on N-grams

AUTHOR(S):
Athanaselis, Theologos; Institute for Language and Speech Processing
Bakamides, Stelios; Institute for Language and Speech Processing
Dologlou, Ioannis; Institute for Language and Speech Processing

Abstract:
A good indicator of whether a person really knows the context of language is the ability to use in correct order the appropriate words in a sentence. The "scrambled" words cause a meaningless and ill formed sentences. Since the language model, is extracted from a large text corpus, it encodes the local dependencies of words. The word order errors usually violated the syntactic rules locally and therefore the N-grams can be used in order to fix ill-formed sentences. This paper presents an approach for repairing word order errors in text by reordering words in a sentence and choosing the version that maximizes the number of trigram hits according to a language model. The novelty of this method concerns the use of an efficient confusion matrix technique for reordering the words. The comparative advantage of this method is that works with a large set of words, and avoids the laborious and costly process of collecting word order errors for creating error patterns.

 
Poster Session 4: Syntax, Semantics, Pragmatics3 of 21

Focal Pitch Accents and Subject Positions in Spanish: Comparing Close-to-Standard Varieties and Argentinean Porteno

AUTHOR(S):
Gabriel, Christoph; University of Osnabruck, FB 7

Abstract:
In Spanish focus can be signaled by both prosodic and syntactic means. However, it remains controversial how these two components depend on one another. Based on the analysis of experimental data I argue that in Spanish focus is primarily expressed by intonation. Unlike most Spanish dialects, Argentinean Porteno allows for a tonal distinction between neutral and contrastive focus in IP-final position; in other positions focus is signaled through increased F0 values and/or syllable-internal early peak alignment. In addition, reordering of constituents can apply. Movement as a facultative strategy of focus marking is avoided in sentences with a full DP object, but strongly preferred with a clitic object. The variation encountered in the data is accounted for by combining Minimalist phrase structure building with the insights of the optimality-theoretic model of overlapping constraints.

 
Poster Session 4: Syntax, Semantics, Pragmatics4 of 21

Prosodic Realization of Information Structure Categories in Standard Chinese

AUTHOR(S):
Chen, Yiya; Radboud University Nijmegen
Braun, Bettina; Max Planck Institute for Psycholinguistics

Abstract:
This paper investigates the prosodic realization of information structure categories in Standard Chinese. A number of proper names with different tonal combinations were elicited as a grammatical subject in five pragmatic contexts. Results show that both duration and F0 range of the tonal realizations were adjusted to signal the information structure categories (i.e. theme vs. rheme and background vs. focus). Rhemes consistently induced a longer duration and a more expanded F0 range than themes. Focus, compared to background, generally induced lengthening and F0 range expansion (the presence and magnitude of which, however, are dependent on the tonal structure of the proper names). Within the rheme focus condition, corrective rheme focus induced more expanded F0 range than normal rheme focus.

 
Poster Session 4: Syntax, Semantics, Pragmatics5 of 21

Emphasis, Syllable Duration, and Tonal Realization in Standard Chinese

AUTHOR(S):
Chen, Yiya; Radboud University Nijmegen

Abstract:
This study examines the durational and F0 adjustments employed to convey degrees of emphasis in Standard Chinese (SC). Three speakers produced four lexical tones with varied preceding and following tones. Corrective focus, with two degrees of emphasis on the target syllable (i.e. Emphasis and More-Emphasis), was elicited, in addition to a No-Emphasis condition (as the baseline for comparison). Results showed a gradual increase of syllable duration: The magnitude of increase from the No-Emphasis to the Emphasis condition and that from the Emphasis to the More-Emphasis condition were comparable. F0 range expansion, however, was non-gradual. While there was a robust increase of F0 range from the No- Emphasis to the Emphasis condition, the expansion from the Emphasis to the More-Emphasis condition was reduced. The F0 contours of the individual tones suggest that when emphasized, tones were realized with distinctive F0 patterns, adapting to the tonal contexts and the increase of duration.

 
Poster Session 4: Syntax, Semantics, Pragmatics6 of 21

Tonal Constituents and Meanings of Yes-No Questions

AUTHOR(S):
Hedberg, Nancy; Simon Fraser University
Sosa, Juan; Simon Fraser University
Fadden, Lorna; Simon Fraser University

Abstract:
We analyzed the different meanings associated with the tonal contours of 104 positive yes-no questions from the CallHome Corpus of American English. We take into consideration such broad constituents as the head, nucleus and tail of intonational phrases, as well as ToBI sequences of pitch accents, phrase accents and boundary tones. The meaning of a question as unmarked or marked in a variety of ways is shown to depend upon the intonational contours associated with these broad constituents, and even withte contour associated with the question as a whole.

 
Poster Session 4: Syntax, Semantics, Pragmatics7 of 21

Prosodic Properties of Constituents Associated with Stressed 'auch' in German

AUTHOR(S):
Sudhoff, Stefan; Universität Leipzig, Institut für Linguistik
Lenertová, Denisa; Universität Leipzig, Institut für Linguistik

Abstract:
We report a production experiment and two perception studies examining the prosodic characteristics of constituents associated with the stressed variant of the German particle 'auch' (also) in potentially ambiguous constructions. The results show that these elements are marked by perceptually relevant rising pitch accents, 56 SPEECH PROSODY 2006 but that there is no 1:1 mapping between the prosodic realization and the status of being associated with 'auch'.

 
Poster Session 4: Syntax, Semantics, Pragmatics8 of 21

Russian personal pronouns in Syntax and Phonology

AUTHOR(S):
Mleinek, Ina; University of Leipzig
Werkmann, Valja; University of Leipzig

Abstract:
The question we will address is how far the syntactic positions of Russian personal pronouns affect their phonological properties. To this aim we examined their phonological behaviour in three structural slots within the sentence (first experiment) and then in the right-peripheral position associated with sentence stress (second experiment). Probing Rappaport's 1988 idea of the verb as the prosodic host for de-stressed Russian personal pronouns we wanted to know whether prosodic cliticization (indicated by times of silence/pauses and steps up/down of F0 values) is rather determined (1) by focus; (2) by morphosyntactic categories; or (3) by direction. The result is a combination of all three possibilities, and according to our second experiment, Russian personal pronouns are functional words in the broad focus condition while in conditions with contrastive and minimal foci, Russian personal pronouns can receive sentence stress and thus, behave like lexical or content words.

 
Poster Session 4: Syntax, Semantics, Pragmatics9 of 21

Can prosodic cues and function words guide syntactic processing and acquisition?

AUTHOR(S):
Millotte, Séverine; Laboratoire de Sciences Cognitives et Psycholinguistique (EHESS-ENS-CNRS)
Wales, Roger; Faculty of Humanities and Social Sciences, La Trobe University
Dupoux, Emmanuel; Laboratoire de Sciences Cognitives et Psycholinguistique (EHESS-ENS-CNRS)
Christophe, Anne; Laboratoire de Sciences Cognitives et Psycholinguistique (EHESS-ENS-CNRS)

Abstract:
We studied the use of phonological phrase boundaries and function words in syntactic processing. French adults performed an abstract word detection task on jabberwocky sentences. We created two conditions: - "with function word" condition: targets were directly preceded by a function word, as in "[une bamoule] [dri se froliter]" ("bamoule" is a noun), and "[tu bamoules] [saman ti]" ("bamoule" is a verb) - "without function word" condition: targets were not directly preceded by a function word; sentence beginnings differed by their prosodic and syntactic structures, as in "[une cramona bamoule] [camiche dabou]" (noun target) vs "[une cramona] [bamoule muche] [le mirtou]" (verb target). Function words and prosodic cues allow listeners to start building a syntactic structure. Adults were able to use phonological phrase boundaries to define syntactic boundaries, and function words to label these constituents.

 
Poster Session 4: Syntax, Semantics, Pragmatics10 of 21

Acoustic prominence and reference accessibility in language production

AUTHOR(S):
Watson, Duane; University of Illinois Urbana-Champaign
Arnold, Jennifer; University of North Carolina, Chapel Hill
Tanenhaus, Michael; University of Rochester

Abstract:
Two experiments explored discourse and communicative factors that contribute to the perceived prominence of a word in an utterance, and how that prominence is realized acoustically. In Experiment 1 two hypotheses were tested: (1) acoustic prominence is a product of the given-new status of a word and (2) acoustic prominence depends on the degree to which a referent is accessible, where greater acoustic prominence is used for less accessible entities. In a referential communication task, speakers used acoustic prominence to indicate referent accessibility change, independent of given-new status. In Experiment 2 a variant of Tic Tac Toe was used to investigate whether effects of accessibility are driven by a need to signal the importance of a word or to indicate the word's predictability. The results indicate that both importance and predictability contribute to the prominence of a word, but in different ways.

 
Poster Session 4: Syntax, Semantics, Pragmatics11 of 21

More than pointing with the prosodic focus: The Valence-Intensity-Domain (VID) model

AUTHOR(S):
Aubergé, Véronique; ICP CNRS
Rilliard, Albert; ICP CNRS

Abstract:
This paper summarizes several perception experiments showing that the morphology of the prosodic focus conveys more that the information of the deixis function: (1) the binary valence - yes/no focus - which is perceptively quite categorical (a magnet effect is clear on the basis of an identification and a discrimination experiment), (2) the intensity information, used by the speaker to give his preference between two focused elements, (3) the information of the focus domain, that are some segmentation cues about the focused element (phonological unit or word unit), which are perceptively identified by listeners. The morphological cues revealing Valence-Intensity-Domain are observed in particular in morphing procedure making clear the thresholds of quite-categorical behaviors.

 
Poster Session 4: Syntax, Semantics, Pragmatics12 of 21

Focus-related pitch range manipulation (and peak alignment effects) in Egyptian Arabic

AUTHOR(S):
Hellmuth, Sam; Department of Linguistics, SOAS, University of London & Institut für Linguistik, Universität Potsdam

Abstract:
This paper explores focus-related effects on pitch range and on peak alignment in Egyptian Arabic (EA), and interaction between them. Qualitative analysis of elicited focus data shows that even when post-focal and 'given', EA words bear a pitch accent. Quantitative analysis reveals gradient effects of focus in the form of pitch range manipulation but which reflects identificational/contrastive focus, not information focus. Peak alignment shows an indirect effect of post-focal F0 compression. It is argued that pitch range manipulation is used in EA to express identificational/contrastive focus only (not given  new/information focus); the effect is argued to be phonologically gradient because the effects emerge not only on focused items, as F0 expansion, but also on post-focal items, in the form of F0 compression and earlier peak alignment.

 
Poster Session 4: Syntax, Semantics, Pragmatics13 of 21

An Experimental Study on the Assignment of Focus Accent in Mandarin

AUTHOR(S):
Wang, Yunjia; Peking University
Chu, Min; Microsoft Research Asia

Abstract:
This paper investigates the distribution of focus-related accents in the broad focus domain in Chinese Mandarin through 300 natural sentences. The results show that focus-related accent tends to be assigned to the predicate in a subject-predicate structure, to the object in a predicate-object structure, and to the head in an adjunct-head structure unless the head is highly predictable. From these observations, we conclude that, in a broad focus structure in Chinese Mandarin, the focus-related accent is normally assigned to the innermost constituent of the sentence if this constituent has enough semantic weight; otherwise, the accent is placed in the constituent that has the closest syntactic relationship to the innermost one.

 
Poster Session 4: Syntax, Semantics, Pragmatics14 of 21

Predicting Prosodic Phrasing Using Linguistic Features

AUTHOR(S):
Yoon, Tae-Jin; University of Illinois at Urbana-Champaign

Abstract:
The prosodic structure of speech is based on complex interaction within and between several different levels of linguistic, and paralinguistic organization. Though leading theories of prosody maintain that prosody is shaped through the interaction of grammatical factors from phonology, syntax, semantics, and pragmatics, there is no consensus on how to model their interaction. I provide a new probabilistic model of the mapping between prosody and phonology, syntax, and argument structure. The model encodes phonological features, shallow syntactic constituent structure, and basic argument structure. A machine learning experiment using these features to predict prosodic phrase boundaries achieves more than 92 % accuracy in predicting prosodic boundary location. An experiment for predicting the strength of prosodic boundaries achieve 88.06 % accuracy. This study sheds light on the relationship between prosodic phrase structure and other grammatical structures.

 
Poster Session 4: Syntax, Semantics, Pragmatics15 of 21

Utterance Final Forms in Dialogues by Young Japanese: A Syntactic and Prosodic Analysis

AUTHOR(S):
Nishinuma, Yukihiro; CNRS Laboratoire Parole et Langage
Hayashi, Akiko; Chuo University
Yabe, Hiroko; Tokyo Gakugei University

Abstract:
This work reports findings on the relationship between speaker-sex and linguistic behavior among young Japanese in explanation-giving dialogues. The relationship between speaker-sex and (1) the choice of utterance final forms; (2) the prosodic characteristics on these forms, has thus been examined. Data obtained from 110 students of the Tokyo area revealed no statistically significant effect of the sex factor in the syntactic forms used. However utterance final syllables had a statistically significant effect both on rhythm and intonation.

 
Poster Session 4: Syntax, Semantics, Pragmatics16 of 21

Cross-dialectal Turn Exchange Rhythm in English Interviews

AUTHOR(S):
Fon, Janice; Graduate Institute of Linguistics, National Taiwan University

Abstract:
This study looked at the relationship between rhythm and exchange type in a stress-timed language, British English, and a syllable-timed language, Singaporean English, using a spontaneous speech corpus. Exchange intervals (EIs) were measured and different exchange types were labeled. Results showed that in a dialog, EIs were generally limited to a narrow range. However, within the range, EIs had four functions. First, EIs indicated linguistic rhythm. Singaporean English tended to have shorter EIs. Second, EIs reflected the cognitive load and tightness of coupling in differentiating various exchange types. EIs in pairs requiring more cognitive resources and tighter coupling were longer than in those not as cognitively-loaded but were shorter than in pairs not as tightly coupled. Moreover, EIs reflected discourse organization. Topic initiation EIs were longer than topic ending ones. Finally, the degree of politeness correlated positively with EI. Asian females' EIs were lengthened.

 
Poster Session 4: Syntax, Semantics, Pragmatics17 of 21

The Effect of Paralinguistic Emphasis on F0 Contours of Cantonese Speech

AUTHOR(S):
Gu, Wentao; The University of Tokyo
Hirose, Keikichi; The University of Tokyo
Fujisaki, Hiroya; The University of Tokyo

Abstract:
Emphasis has a significant effect on F0 contours in various languages, among which tone languages require more careful study because their F0 contours show complex interaction between lexical tones and phrase intonation. Here we employ the command-response model to investigate the effect of paralinguistic emphasis in Cantonese, a typical tone language with nine lexical tones. Following our previous study on target syllables in a fixed carrier frame, the current study continues to investigate the utterances with natural context, in which the effects of emphasis with different scopes and on different parts of utterance are compared. It shows that the major effect of emphasis is not on tone commands but on phrase commands. The narrowness/broadness of emphasis can be distinguished by the number of phrase commands being affected in the phonetic realization. By use of the command-response model, F0 contours for expressive speech conveying emphasis information can be generated efficiently.

 
Poster Session 4: Syntax, Semantics, Pragmatics18 of 21

Prosodic and informational aspects of polar questions in Neapolitan Italian

AUTHOR(S):
Crocco, Claudia; University of Salerno

Abstract:
In this paper the relation between prosodic form and meaning is investigated in a sample of polar questions in Neapolitan Italian, taken from 4 Map Task dialogues. The sample is analyzed from both the informational and the prosodic point of view. The information analysis found 4 groups of questions, distinguished by their function or by the degree of accessibility of the referents they contain. The groups were then put in relation to the conversational Map Task moves, and to the results of the prosodic analysis. The results of this analysis show that the YNQs questions in Neapolitan Italian have a common prosodic pattern. Their different functions, i.e. confirmation-seeking and information-seeking, are expressed with a variety of means that, together with the information provided by the context, concur to orient the interpretation.

 
Poster Session 4: Syntax, Semantics, Pragmatics19 of 21

Argument Structure and Focus Projection in Korean

AUTHOR(S):
Kim, Hee-Sun; Stanford University
Jun, Sun-Ah; UCLA
Lee, Hyuck-Joon; UCLA
Kim, Jong-Bok; Kyunghee University

Abstract:
It has been claimed that syntactic structures and the argument types can determine the domain of focus: focus on a particular type of internal argument may project its focus domain to a larger syntactic constituent than the focused item. It is also known that focus often has prosodic reflections through the manipulations of prosodic phrasing, prominence relation of words, and duration. This paper examines the relationship between the focus projection and the argument structure in Korean by investigating the prosodic correlates of focus. Results show that there is no sensitivity of argument type in projecting the domain of focus to Verb Phrase. Regardless of argument types or word order, VP focus was prosodically marked at the VP-initial word by initiating a large intonational phrase boundary, raising its pitch peak, and lengthening of the VP-initial syllable and word. The results do not support the claim that the argument structure is an important factor in focus projection.

 
Poster Session 4: Syntax, Semantics, Pragmatics20 of 21

Interface between information structure and intonation in Dutch WH-questions

AUTHOR(S):
Chen, Aoju; Max Planck Institute for Psycholinguistics

Abstract:
This study concerns how accent placement is pragmatically governed in Dutch WH-questions. Central to this issue are questions such as how intonation of the WH-word is related to information structure of the non-WH word part, whether topical constituents can be accented, and what is the nature of the accents in the non-WH word part. Different treatments of these questions in earlier approaches result in conflicting predictions on the intonation of WH-questions. We addressed these questions by analysing a corpus of 90 naturally occurring WH-questions. Results show that the intonation of the WH-word is related to the information structure of non-WH word part. Moreover, topical constituents can be accented and accents in the non-WH word part are not necessarily phonetically reduced. Moreover, we have observed that the speaker may have communicative motivation to accent the WH-word or adverbs not part of the presupposition in addition to pragmatic motivation.

 
Poster Session 4: Syntax, Semantics, Pragmatics21 of 21

Syntactic and prosodic parenthesis

AUTHOR(S):

AUTHOR(S):
Peters, Jörg; Radboud University Nijmegen

Abstract:
This paper examines the view that parentheticals obligatorily form an intonational phrase and break up the intonational phrase of the matrix sentence into two intonational phrases. The analysis of spontaneous speech data of Hamburg German shows that neither do all parentheticals form a distinct intonational phrase nor do all parentheticals break up the intonational phrase of the matrix sentence. The most frequent type of prosodic integration is prosodic parenthesis, which is the insertion of one intonational phrase into another and parallels parenthesis on the syntactic level. Additional analyses reveal that the size of the parenthetical and the syntactic integration of the parenthetical into the matrix sentences affect its prosodic integration. Finally, it is argued that the distinction between syntactic and prosodic parenthesis can solve common problems in defining parentheticals.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Special Session 3 (SPS 3): Understanding Emotions in Speech: Neural and Cross-cultural Evidence
Organizers: Sonja A. Kotz and Marc D. Pell
Wednesday, May 3, 16:00 - 18:00
Special Session 3: Understanding Emotions in Speech: Neural and Cross-cultural Evidence1 of 5

Non-verbal expressions of emotion - acoustics, valence and cross cultural factors

AUTHOR(S):
Scott, Sophie; Institute of Cognitive Neuroscience, University College London
Sauter, Disa; Institute of Cognitive Neuroscience, University College London

Abstract:
This presentation will address aspects of the expression of emotion in non-verbal vocal behaviour, specifically attempting to determine the roles of both positive and negative emotions, their acoustic bases, and the extent to which these are recognized in non-Western cultures.

 
Special Session 3: Understanding Emotions in Speech: Neural and Cross-cultural Evidence2 of 5

Implicit recognition of vocal emotions in native and non-native speech

AUTHOR(S):
Pell, Marc D.; School of Communication Sciences and Disorders, McGill University, Montréal

Abstract:
There is evidence for both cultural-specificity and 'universality' in how listeners recognize vocal expressions of emotion from speech. This paper summarizes some of the early findings using the Facial Affect Decision Task which speak to the implicit processing of vocal emotions as inferred from "emotion priming" effects on a conjoined facial expression. We provide evidence that English listeners register the emotional meanings of prosody when processing sentences spoken by native (English) as well as non-native (Arabic) speakers who encoded vocal emotions in a culturallyappropriate manner. As well, we discuss the timecourse for activating emotion-related knowledge in a native and nonnative language which may differ due to cultural influences on vocal emotion expression.

 
Special Session 3: Understanding Emotions in Speech: Neural and Cross-cultural Evidence3 of 5

Examining the neural mechanisms involved in the affective and pragmatic coding of prosody

AUTHOR(S):
Grandjean, Didier; Swiss Centre of Affective Sciences, University of Geneva
Scherer, Klaus R.; Swiss Centre of Affective Sciences, University of Geneva

Abstract:
The vocal expression of humans includes expressions of emotions, such as anger or happiness, and pragmatic intonations, such as interrogative or affirmative, embedded within the language. These two types of prosody are differently affected by the so-called push and pull effects. Push effects, influenced by psychophysiological activities, strongly affect emotional prosody, whereas pull effects, influenced by cultural rules of expression, predominantly affect intonation or pragmatic prosody, even though both processes influence all prosodic production. Two empirical studies are described that exemplify the possibilities of dissociating emotional and linguistic prosody decoding at the neurological level. The first study was conducted to investigate the impairments in prosody recognition related to left or right temporoparietal brain-damaged patients. The second study used electroencephalography in healthy participants to investigate the timing of information processing during emotional and linguistic prosody recognition tasks. The results highlight the importance of considering not only the distinction of different types of prosody, but also the relevance of the task realized by the participants to better understand information processes related to human vocal expression at the suprasegmental level.

 
Special Session 3: Understanding Emotions in Speech: Neural and Cross-cultural Evidence4 of 5

Development of the Brain Mechanism for Understanding Speakers' Intent from Speech

AUTHOR(S):
Imaizumi, Satoshi; Prefectural University of Hiroshima
Noguchi, Yuki; Prefectural University of Hiroshima
Homma, Midori; Prefectural University of Hiroshima
Yamasaki, Kazuko; Prefectural University of Hiroshima
Maruishi, Masaharu; Hiroshima Prefectural Rehabilitation Center
Muranaka, Hiroyuki; Hiroshima Prefectural Rehabilitation Center

Abstract:
To clarify how the brain understands the speaker's mind for verbal acts, fMRI images obtained from 24 subjects and behavioral data obtained from 339 subjects were analyzed when they judged the linguistic meanings or emotional manners of spoken phrases. The target phrases had linguistically positive or negative meanings and were uttered warmheartedly or coldheartedly by a woman speaker. The results of the fMRI analyses suggest that neural resources responsible for the speakers' mind reading are distributed over the superior temporal sulci, inferior frontal regions, medial frontal regions and posterior cerebellum. The correct judgment of the speaker intentions significantly increased with age for the phrases with inconsistent linguistic and emotional valences. Female children showed faster development than male children. The neural mechanism to interpret speaker's real intensions from spoken phrases develops slowly during the school age.

 
Special Session 3: Understanding Emotions in Speech: Neural and Cross-cultural Evidence5 of 5

efMRI Evidence for Implicit Emotional Prosodic Processing

AUTHOR(S):
Kotz, Sonja A.; Max Planck Institute of Human Cognitive and Brain Sciences, Leipzig
Paulmann, Silke; Max Planck Institute of Human Cognitive and Brain Sciences, Leipzig
Raettig, Tim; Max Planck Institute of Human Cognitive and Brain Sciences, Leipzig

Abstract:
The current efMRI experiment investigated the potential right hemisphere dominance of emotional prosodic processing under implicit task demands. Participants evaluated the relative tonal height (high, medium, low) of intelligible and unintelligible sentences spoken by a trained female speaker of German with three prosodic contours: happy, angry, and neutral. The results confirm the activation of a bilateral fronto-striato-temporal network with no clear right hemispheric preference for emotional prosodic processing. The data suggest that (1) task demands do not significantly alter lateralization of function in the current context, and (2) frontostriatal brain areas engage during implicit processing of emotional prosody, thus do no seem to be task specific.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Oral Session 3 (OS 3): Prosody and Speech Perception
Thursday, May 4, 09:45 - 11:25
Chair: Ailbhe Ní Chasaide
Oral Session 3: Prosody and Speech Perception1 of 5

Phrase-Final Pitch Discrimination in English

AUTHOR(S):
Cummins, Fred; University College Dublin
Doherty, Colin; Royal College of Surgeons in Ireland
Dilley, Laura; The Ohio State University

Abstract:
We investigate the discrimination of phrase final pitch contours within a continuum from statement to question in English. Previous work in German and Dutch has raised questions about the relationship between discrimination sensitivity and category structure within this continuum. To clarify the relationship between linguistic category and simple auditory discrimination, we employ both speech and non-speech stimuli. For all stimuli, we find a discrimination peak at the point in the continuum where a pitch fall changes to a pitch rise. This peak does not appear to be related to the category boundary for speech stimuli, as revealed in a labeling task. Discrimination was somewhat better for non-speech stimuli than speech.

 
Oral Session 3: Prosody and Speech Perception2 of 5

The role of articulation rate in distinguishing fast and slow speakers

AUTHOR(S):
Koreman, Jacques; Saarland University

Abstract:
This article discusses differences in articulation rate between fast and slow speakers in a production experiment. It is shown that fast and slow speakers differ in their articulation rates, both in terms of the number of phones in the canonical form (intended rate) as well as the number of phones present in the actual realization (realized rate). The articulatory precision index, which indicates the relative deletion rate, also differs for these speakers. The same differences are observed for fast and slow inter-pause stretches in a large German database of spontaneous speech. Both in the database and for the production experiment, however, there is considerable overlap between the measurements for fast and slow speakers. This shows that other factors also play a role in distinguishing fast and slow speakers or inter-pause stretches. The relationship between these factors and the articulation rates is discussed.

 
Oral Session 3: Prosody and Speech Perception3 of 5

Toddlers are sensitive to prosodic correlates of disfluency in spontaneous speech

AUTHOR(S):

AUTHOR(S):
Soderstrom, Melanie; Brown University
Morgan, James L.; Brown University

Abstract:
The ability to distinguish fluent from disfluent speech could play an important role in infants' acquisition of their first language. Across two experiments using a Headturn Preference Procedure, we show that infants are able to distinguish fluent from disfluent speech based on its prosodic characteristics, and show a preference for listening to fluent English. In the first experiment, 22-month-old, but not 10-month-old, infants preferred to listen to fluent adult-directed speech samples over disfluent matched speech samples. In the second experiment, lexical and grammatical information were removed. Older infants still discriminated fluent from disfluent speech, but showed the reverse preference, for disfluent speech.

 
Oral Session 3: Prosody and Speech Perception4 of 5

Modelling Hesitation for Synthesis of Spontaneous Speech

AUTHOR(S):
Carlson, Rolf; TMH, CSC, KTH, Stockholm, Sweden
Gustafson, Kjell; Acapela Group, Stockholm, Sweden
Strangert, Eva; Phonetics, Ume°a University, Sweden

Abstract:
The current work deals with the modelling of one type of disfluency, hesitations. A perceptual experiment using speech synthesis was designed to evaluate two duration features found to be correlates to hesitation, pause duration and final lengthening. A variation of F0 slope before the hesitation was also included. The most important finding is that it is the total duration increase that is the valid cue rather than the contribution by either factor. In addition, our findings lead us to assume an interaction with syntax. The absence of strong effects of the induced F0 variation was unexpected and we consider several possible explanations for this result.

 
Oral Session 3: Prosody and Speech Perception5 of 5

Neural correlates of rhythm processing in speech perception

AUTHOR(S):
Geiser, Eveline; University of Zurich
Schmidt, Conny; University of Zurich
Jancke, Lutz; University of Zurich
Meyer, Martin; University of Zurich

Abstract:
The present study investigates the neural correlates of speech perception. Metric and non-metric German pseudo-sentences were compared in an fMRI investigation. One group of subjects was to decide which type of sentence they had heard. A second group performed a prosody task on the same stimuli. Group analysis revealed activation in the supplementary motor area (SMA), for the explicit processing group. This activation was not present in the implicit processing group. A direct contrast between the metric and the non-metric sentences for the implicit processing group revealed significant activation in the left planum temporale (PT) for the metric condition. Our results suggest that rhythm processing relies on neural correlates different from those related to speech melody processing. The implicit perception of unexpected speech rhythm relies on brain areas which have earlier been associated with temporal auditory processing in the left hemisphere.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Oral Session 4 (OS 4): Affective Speech
Thursday, May 4, 11:50 - 13:10
Chair: Véronique Aubergé
Oral Session 4: Affective Speech1 of 4

Expressing anger and joy with the size code

AUTHOR(S):
Chuenwattanapranithi, Suthathip; Department of Computer Engineering, King Mongkut's University of Technology Thonburi
Xu, Yi; University College London and Haskins Laboratories
Thipakorn, Bundit; Department of Computer Engineering, King Mongkut's University of Technology Thonburi
Maneewongvatana, Songrit; Department of Computer Engineering, King Mongkut's University of Technology Thonburi

Abstract:
This paper reports our finding of the use of a proposed biological code - the size code in anger and joy speech. In searching for explanations for an F0 peak delay phenomenon related to angry speech that cannot be accounted for by known articulatory constraints, we hypothesized that the delay was due to the lowering of the larynx to exaggerate body size, a biological code known to be used by animals. Our analysis of the formant frequencies in existing emotional speech databases revealed that anger speech had lowered formants and joy speech had raised formants. The results confirm our hypothesis and suggest that the size code is being actively used by humans to express emotions.

 
Oral Session 4: Affective Speech2 of 4

Emotion Elicitation in a Computerized Gambling Game

AUTHOR(S):
Aharonson, Vered; Tel Aviv Academic College of Engineering
Amir, Noam; Tel Aviv University

Abstract:
We have designed a novel computer controlled environment that elicits emotions in subjects while they are uttering short identical phrases. The paradigm is based on Damasio's experiment for eliciting apprehension and is implemented in a voice activated computer game. For six subjects we have obtained recordings of dozens of identical sentences, which are coupled to events in the game - gain or loss of points. Prosodic features of the recorded utterances were extracted and classified. The resultant classifier gave 78-85 % recognition of presence/absence of apprehension.

 
Oral Session 4: Affective Speech3 of 4

Pauses in Deceptive Speech

AUTHOR(S):
Benus, Stefan; Columbia University
Enos, Frank; Columbia University
Hirschberg, Julia; Columbia University
Shriberg, Elizabeth; SRI & ICSI

Abstract:
We use a corpus of spontaneous interview speech to investigate the relationship between the distributional and prosodic characteristics of silent and filled pauses and the intent of an interviewee to deceive an interviewer. Our data suggest that the use of pauses correlates more with truthful than with deceptive speech, and that prosodic features extracted from filled pauses themselves as well as features describing contextual prosodic information in the vicinity of filled pauses may facilitate the detection of deceit in speech.

 
Oral Session 4: Affective Speech4 of 4

Mapping Voice to Affect: Japanese Listeners

AUTHOR(S):
Yanushevskaya, Irena; Phonetics and Speech Laboratory, Trinity College Dublin
Gobl, Christer; Phonetics and Speech Laboratory, Trinity College Dublin
Ní Chasaide, Ailbhe; Phonetics and Speech Laboratory, Trinity College Dublin

Abstract:
This paper reports the results of perception tests administered to speakers of Japanese as part of a cross-language investigation of how voice quality and f0 combine in the signalling of affect. Three types of synthesised stimuli were resented: (1) 'VQ only' involving variations in voice quality and a neutral f0; (2) 'f0 only', with different f0 contours and modal voice; and (3) combined 'VQ + f0' stimuli, where combinations of (1) and (2) were employed. Overall, stimuli involving voice quality variation (1 and 3) proved to be most consistently associated with affect. In series (2) only stimuli with very high f0 yielded high affective ratings. Some striking differences emerge in the ratings obtained for Japanese subjects compared to those obtained for speakers of Hiberno-English, suggesting that the generation of expressive speech synthesis will need to be sensitive to language specific uses of the voice.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Poster Session 5 (PS 5): Speech Technology - Part I: Speech Synthesis
Thursday, May 4, 14:30 - 16:00
Chairs: Keikichi Hirose / Plinio Barbosa
Poster Session 5: Speech Technology - Part I: Speech Synthesis1 of 34

Rule-based Prosody Prediction for German Text-to-Speech Synthesis

AUTHOR(S):
Becker, Stephanie; Saarland University, Saarbrücken
Schröder, Marc; DFKI GmbH, Saarbrücken
Barry, William J.; Saarland University, Saarbrücken

Abstract:
This paper presents two empirical studies that examine the influence of different linguistic aspects on prosody in German. First, we analysed a German corpus with respect to the effect of syntax and information status on prosody. Second, we conducted a listening test which investigated the prosodic realisation of constituents in the German 'Vorfeld' depending on their information status. The results were used to improve the prosody prediction in the German text-to-speech synthesis system MARY.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis2 of 34

Duration Prediction in Mandarin TTS System

AUTHOR(S):
Guo, Qing; Fujitsu Research and Develop Center China, Beijing
Katae, Nobuyuki; Fujitsu Laboratories Ltd.

Abstract:
This paper reports the methodology and result of decision tree based duration prediction for Mandarin text-to-speech system developed by the Fujitsu Laboratories. Syllable initials and finals are the basic units in our duration study. In this paper, factors influencing the finals, such as phrase boundary and phone context, are discussed in detail. Experiments indicate that the prosodic factor of whether the right phrase boundary level is prosodic word level or higher level is the most important determinant of duration. Furthermore, the degree of phrase boundary vowel lengthening may vary depending on the types of finals. And this paper also explains the methods for objective evaluation of the performance of the duration prediction model. At the last part, prosody evaluation results convincing that the prosody generated by our prosody generation module is much better than that of two famous Mandarin TTS systems.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis3 of 34

Adaptation of Prosodic Phrasing Models

AUTHOR(S):
Bell, Peter; Centre for Speech Technology Research, University of Edinburgh
Burrows, Tina; Speech Technology Group, Toshiba Research Europe Ltd
Taylor, Paul; Department of Engineering, University of Cambridge

Abstract:
There is considerable variation in the prosodic phrasing of speech betweeen different speakers and speech styles. Due to the time and cost of obtaining large quantities of data to train a model for every variation, it is desirable to develop models that can be adapted to new conditions with a limited amount of training data. We describe a technique for adapting HMM-based phrase boundary prediction models which alters a statistic distribution of prosodic phrase lengths. The adapted models show improved prediction performance across different speakers and types of spoken material.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis4 of 34

F0 and Segment Duration in Formant Synthesis of Speaker Age

AUTHOR(S):
Schötz, Susanne; Linguistics and Phonetics, Centre for Languages and Literature, Lund University

Abstract:
This paper describes the work with F0 and segment duration when developing a prototype system for analysis of speaker age using data-driven formant synthesis. The system was developed to extract 23 parameters from the test words - spoken by four differently aged female speakers of the same dialect and family - and to generate synthetic copies. Audio-visual feedback enabled the user to compare the natural and synthetic versions and facilitated parameter adjustment. Next, weighted linear interpolation was used in a first crude attempt to synthesize speaker age. Evaluation of the system revealed its strengths and weaknesses, and suggested further improvements. F0 and duration performed better than most other parameters.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis5 of 34

High Resolution Speech F0 Modification

AUTHOR(S):
Bardi, Tamas; Faculty of Information Technology, Peter Pazmany Catholic University, Budapest

Abstract:
The present paper propose a new algorithm for pitch modification which is convenient for changing the fundamental frequency of speech with so fine resolution that is at least comparable with human pitch perception. Using the proposed method, measurements of just noticeable changes on speech prosody becomes possible. High resolution F0 manipulation is completed without explicit over-sampling of the signal, our FFT-based fast interpolation technique is used instead. Our algorithm is based on LP-PSOLA method. Though its frequency resolution was enhanced especially for research purposes, possibly the need for it comes up from real applications of expressive speech synthesis in the future.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis6 of 34

Effects of Prosodic Factors on Spectral Balance: Analysis and Synthesis

AUTHOR(S):
Miao, Qi; Oregon Health & Science University
Niu, Xiaochuan; Oregon Health & Science University
Klabbers, Esther; Oregon Health & Science University
van Santen, Jan; Oregon Health & Science University

Abstract:
In natural speech, prosodic factors such as accent, stress, phrasal position and speaking style play important roles in controlling several acoustic features, including segmental duration, pitch, and spectral balance. To synthesize speech that sounds natural, these effects need to be accurately modeled. In this study we describe and evaluate a synthesis method that mimics the effects of prosodic factors on spectral balance. We measure spectral balance by using the energy in four broad frequency bands that correspond to formant frequency ranges. An additive model is used to capture the effects of prosodic factors on spectral balance. A new sinusoidal synthesis module is implemented under Festival to predict the target spectral balance from analysis results and apply it to the amplitude parameters of the sinusoidal model during synthesis. We evaluate an important strength of this system, which is its ability to reduce spectral discontinuities in unit concatenation.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis7 of 34

Decomposition of Pitch Curves in the General Superpositional Intonation Model

AUTHOR(S):
Mishra, Taniya; Oregon Health & Science University
van Santen, Jan; Oregon Health & Science University
Klabbers, Esther; Oregon Health & Science University

Abstract:
This paper describes and applies a new algorithm for decomposing pitch curves into component curves, in accordance with the General Superpositional Model of Intonation. According to this model, which is a generalization of the Fujisaki model, a pitch contour can be described as the sum of component curves that are each associated with different phonological levels, including the phrase, foot, and phoneme. The algorithm assumes that the phrase curve is locally linear during intervals spanned by a foot. The algorithm was evaluated using synthetically generated curves, and was found to accurately recover the synthetic component curves. The algorithm was also evaluated in a perceptual experiment, where speech generated by concatenation of accent curves was shown to produce better speech quality than speech based on direct concatenation of "raw" pitch curve fragments.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis8 of 34

An innovative F0 modeling approach for emphatic affirmative speech, applied to the Greeklanguage

AUTHOR(S):
Giannopoulos, Gergios; Institute for Language and Speech Processing
Chalamandaris, Aimilios; Institute for Language and Speech Processing

Abstract:
In this paper we present an innovative algorithm for modelling the fundamental frequency F0 for the Greek language, for sentences containing emphatic segments. The main idea of our approach is the definition of a specific set of intonation word models, derived from a spoken corpus, the use of which is sufficient in modeling the pitch contour of arbitrary long sentences similarly structured. Our method is based on a prosodic unit selection approach. The system was designed and trained on a spoken corpus of 120 naturally uttered sentences of weather forecasts, containing emphasis segments and has proved to be very efficient in coping with similarly structured sentences. In the first section of the paper we present a brief review of the existing literature on this field, in addition with analogous approaches for other languages. In the second section we present our method and the design procedure. The last two sections contain the preliminary results acquired from our experiments as well as conclusions and refer to future work that needs to be carried out.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis9 of 34

Prosody generation in the Speech-to-Speech Translation Framework

AUTHOR(S):
Agüero, Pablo Daniel; UPC
Adell, Jordi; UPC
Bonafonte, Antonio; UPC

Abstract:
This paper deals with speech synthesis in the framework of speech-to-speech translation. Our current focus is to translate speeches or conversations between humans so that a third person can listen to them in its own language. In this framework the style is not written but spoken and the original speech includes a lot of nonlinguistic information (as speaker emotion). In this work we propose the use of prosodic features in the original speech to produce prosody in the target language. Relevant features are found using an unsupervised clustering algorithm that finds, in a bilingual speech corpus, intonation clusters in the source speech which are relevant in the target speech. Preliminary results already show a significant improvement in the synthetic quality.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis10 of 34

Facing data scarcity using variable feature vector dimension

AUTHOR(S):
Agüero, Pablo Daniel; UPC
Bonafonte, Antonio; UPC

Abstract:
This paper focuses on three key points of intonation modelling: interpolation of fundamental frequency contour, sentence by sentence parameter extraction and data scarcity. In some cases, they introduce noise and inconsistency on training data reducing the performance of machine learning techniques. We consider that the F0 contour is segmented into prosodic units (such as accent groups, minor phrases, etc). Each segment of F0 contour has a corresponding feature vector with linguistic and non-linguistic components. We propose to face the limitations mentioned above using a technique based on clustering using different feature vector dimensions. The clustering of feature vectors produces also a partition in the F0 contour space. The proposal consists on a procedure to select the dimension that contributes to predict the best fundamental frequency contour from a RMSE sense compared to a reference contour. Experimental results show an improvement compared to other approaches.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis11 of 34

Disfluent Speech Analysis and Synthesis: a preliminary approach

AUTHOR(S):
Adell, Jordi; Universitat Polit`ecnica de Catalunya
Bonafonte, Antonio; Universitat Polit`ecnica de Catalunya
Escudero, David; Universidad de Valladolid

Abstract:
Despite the existence of high quality speech synthesisers based on unit selection, they are based on a reading style approach. However, new applications such as Speech-to-Speech Translation or Speech User Interfaces request for a talking style which is more natural in these contexts. Disfluencies are a major characteristic of talking style. It is thus, convenient to be able to generate disfluent speech. In the present paper a preliminary analysis of repetitions and filled pauses pitch and segmental duration is presented. Simple rules to predict these prosodic features are derived from the previous analysis and used for synthesis. Evaluation shows an increase in naturality while overall quality is decreased.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis12 of 34

Structural Data-Driven Prosody Model for TTS Synthesis

AUTHOR(S):
Romportl, Jan; Department of Cybernetics, University of West Bohemia in Pilsen

Abstract:
This paper introduces a new data-driven prosody model for the text-to-speech system ARTIC. The model is intended to be almost language-independent and to generate naturally sounding intonation with a link to semantics. It is based on text parametrisation using a new prosodic grammar and on automatic speech corpora analysis methods. Its performance is evaluated by results of presented listening tests.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis13 of 34

Language- and Speaker-Specific Implementation of Intonation Contours in Mul-tilingual TTS Synthesis

AUTHOR(S):
Lobanov, Boris; United Institute of Informatics Problems, Minsk
Tsirulnik, Liliya; United Institute of Informatics Problems, Minsk
Zhadinets, Dmitry; United Institute of Informatics Problems, Minsk
Karnevskaya, Helena; Minsk Linguistic State University

Abstract:
The paper is concerned with the study of complete/incomplete phrase intonation and its language- and speaker-specific peculiarities. A phrase, according to the model used, is represented by a sequence of accentual units consisting of prenucleus, nucleus and post-nucleus. The procedure of speech test material preparation and techniques for language- and speaker-specific intonation analysis are described. The results of intonation analysis have been obtained on materials of Russian and Polish native speakers reading aloud a text. The implementation of intonation 'portraits' in the unified text-to-speech synthesis system for Slavonic languages with the ability of personal speaking manner cloning is discussed.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis14 of 34

Statistical Study of Speaker's Peculiarities of Utterances into Phrases Segmen-tation

AUTHOR(S):
Lobanov, Boris; United Institute of Informatics Problems, Minsk
Tsirulnik, Liliya; United Institute of Informatics Problems, Minsk

Abstract:
The report is concerned with the experimental study of the idiosyncrasy of utterances into phrase segmentation observed in the speech of a popular Russian TV-anchorman and two TV-news speakers. The audio recordings were initially transcribed, the primary and secondary stresses, as well as phrase boundaries and phrase intonation types were identified. Comparative statistical estimation of relative frequencies of occurrence of pauses of various duration, frequencies of occurrence of phrases with a different number of accent units (AU) and frequencies of occurrence of pairs of phrases with various numbers of AUs were computed. The results of the study have been applied to the system of individual voice cloning using a text-to-speech synthesis.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis15 of 34

Rule-based Generation of Phrase Components in Two-step Synthesis of Funda-mental Frequency Contours of Mandarin

AUTHOR(S):
Sun, Qinghua; Graduate School of Engineering,University of Tokyo
Hirose, Keikichi; Graduate School of Information Science and Technology, University of Tokyo
Gu, Wentao; Graduate School of Information Science and Technology, University of Tokyo
Minematsu, Nobuaki; Graduate School of Frontier Sciences, University of Tokyo

Abstract:
A rule-based method was developed for realizing phrase components in our twostep generation of fundamental frequency (F0) contours of Mandarin. Motivated by the F0 contour generation process model, the two-step scheme assumes (logarithmic) F0 contours as superposition of tone components on phrase components assumed to be responses of phrase commands. Too long phrase components cause a flat F0 contour close to baseline, which is not the case in human speech. In the case of tone languages such as Mandarin, tone components can be negative. Hence, to give a margin for downward F0 movement, phrase components need to keep above a certain level, causing more frequent phrase commands as compared to non-tonal languages. Based on these facts, simple rules were constructed for phrase component generation. Speech synthesis was conducted using F0 contours generated by the method. The result of listening test showed a good control of F0 contours being realized by the method.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis16 of 34

Efficient Speech Synthesis System using the Deterministic plus Stochastic Model

AUTHOR(S):
Erro, Daniel; Technical University of Catalonia
Moreno, Asunción; Technical University of Catalonia

Abstract:
In this paper, a high-quality concatenative synthesis system using the deterministic plus stochastic model of speech is described, in which the prosodic modifications are performed by means of very simple and efficient operations, as we reported in a previous work. In particular, pitch-synchrony is not necessary, and linear interpolations substitute other types of estimation. The method for the concatenation of units has been improved in order to avoid waveform and spectral mismatches.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis17 of 34

Towards an Automatic Foreign Accent Reduction Tool

AUTHOR(S):
Cho, Kwansun; University of Florida
Harris, John G.; University of Florida

Abstract:
An automatic tool to reduce foreign-accent is described and evaluated. An unaccented speech utterance was used to improve three prosodic features of a corresponding foreign-accented utterance. The duration, pitch and intensity of the foreign-accented speech utterance were modified using DTW (Dynamic TimeWarping), WSOLA (Waveform Similarity Overlap Add), and other automatic speech processing algorithms. The modified speech utterance was then evaluated to determine the perceived foreign accent compared to the original. Fifteen native speakers of American English took part in the perceptual test to rate the degree of foreign-accent in Korean-accented American English. The results show that the modified Korean-accented utterances were perceived to have a lower degree of foreign-accent than the original Korean-accented utterances.

 
Poster Session 5: Speech Technology - Part I: Speech Synthesis18 of 34

Efficient Technique for Quantization of Pitch Contours

AUTHOR(S):
Nurminen, Jani; Nokia Research Center
Himanen, Sakari; Nokia Research Center
Rämö, Anssi; Nokia Research Center

Abstract:
This paper introduces an efficient technique for pitch contour quantization designed mainly for applications that require storage of speech or prosodic information at a high compression ratio. Instead of quantizing the estimated pitch values directly, the proposed technique forms and quantizes a simplified model of the pitch contour. The simplified contour is constructed in such a manner that the amount of information needed for describing it is minimized. At the same time, the deviation from the original contour is maintained below a predetermined limit. In addition to the high compression ratio, the contour representation offers benefits in pitch-synchronous decoding. The proposed technique is implemented and evaluated in a practical storage speech coder. According to the evaluation, the performance of the quantization technique is very promising as it achieves perceptually satisfactory quality at an average bit rate of about 100 bits per second.

Poster Session 5 (PS 5): Speech Technology - Part II: Speech Recognition and Understanding Thursday, May 4, 14:30 - 16:00
Chairs: Keikichi Hirose / Plinio Barbosa
 
Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding19 of 34

F0 Characteristics of Yes-No Question Intonation in Arabic and English: Dis-ambiguation Techniques for Use in ASR

AUTHOR(S):
Barrett, Leslie; EDGAR Online
Hata, Kazue; Santa Barbara

Abstract:
This paper presents preliminary research into the possibility of using F0 information to enhance the performance of speech-to-speech translation engines and speech recognition software for Arabic and English. Specifically, we aim to find factors that differentiate yes-no question in both languages from other sentential types. Although previous research using cross-linguistic question data has shown F0 rise to be the main indicator of yes-no questions, the particular F0 characteristics used by listeners as perceptual cues varied. Using comparative language data, the aim of this study was to find reliable question indicators that could be detected by automated means. In an experiment with short sentences read by a native speaker of each language, we examined aspects of F0 contours in the two languages to find reliable recognition thresholds. Results indicate that reliable indicators of yes-no questions do exist for both languages and occur within the sentence-final 50 centiseconds.

 
Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding20 of 34

Dependency Analysis of Spontaneous Monologue Speech Using Pause and F0 Information: A Preliminary Study

AUTHOR(S):
Takagi, Kazuyuki; The University of Electro-Communications
Ozeki, Kazuhiko; The University of Electro-Communications

Abstract:
This paper deals with the problem of exploiting prosodic information in syntactic analysis of spontaneous monologue utterances of non-professional speakers. Duration of pauses at phrase boundaries and relative F0 contour features, which improve parsing accuracy of read sentences, were also found to be effective for parsing spontaneous speech. Dependency analysis was performed by the minimum penalty parser on academic presentation speech recorded in Corpus of Spontaneous Japanese, a large-scale database of spontaneous Japanese with rich linguistic annotations. Preliminary experiments on relatively clean parts of the monologue data utterances showed that the pause and F0 features are effective to improve the accuracy of dependency analysis of spontaneous utterances, and that combined use of both features will give further improvement. Although this is a preliminary study, the results are promising.

 
Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding21 of 34

Prosodic effects in parsing early vs. late closure sentences by second language learners and native speakers

AUTHOR(S):
Hwang, Hyekyung; University of Hawaii
Schafer, Amy J.; University of Hawaii

Abstract:
The Informative Boundary Hypothesis (IBH: [4]) claims that a prosodic boundary is interpreted relative to preceding boundaries. This study tests predictions of the IBH with Korean learners of English and English native speakers in a prosody experiment on the resolution of an Early vs. Late Closure ambiguity in spoken English sentences. A control experiment assessed and controlled for English morpho-syntactic knowledge in the main experiment. The main experiment presented the syntactically ambiguous portion of sentences in a forced-choice continuation-selection task. The results showed that 1) Korean L2ers at all levels used relative boundary size to disambiguate sentences, like L1ers; 2) intonation phrase boundaries provided stronger evidence for syntactic boundaries than intermediate phrase boundaries, especially for the L2ers; and 3) the IBH's 3-way categorization of relative boundary size - larger/same-size/smaller - appears insufficient for this syntactic structure.

 
Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding22 of 34

Speech Recognition Only with Supra-segmental Features - Hearing Speech as Music

AUTHOR(S):
Minematsu, Nobuaki; Graduate School of Frontier Sciences, The University of Tokyo
Nishimura, Tazuko; Graduate School of Medicine, The University of Tokyo
Murakami, Takao; Graduate School of Information Science and Technology, The University of Tokyo
Hirose, Keikichi; Graduate School of Information Science and Technology, The University of Tokyo

Abstract:
This paper proposes a novel paradigm of speech recognition where only the suprasegmental features are used. Absolute properties of speech events such as formants and spectrums are completely discarded and only the relative and differential properties of the events are extracted as phonic contrasts. They are considered as suprasegmental features and mathematically shown not to carry non-linguistic features such as speaker, age, gender, etc. This fact expects that speaker-independent speech recognition should be possible with the reference models built only with a single speaker's speech. Experiments of vowel sequence recognition show that this expectation is correct and that the performance of the new paradigm is better than that of the conventional paradigm using more than four thousand speakers. Hearing sounds through capturing only their contrasts is often done when hearing musical sounds, indicating that the proposed paradigm hears speech as music.

 
Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding23 of 34

Employing Intonational Events Parameterization for Emotion Recognition

AUTHOR(S):
Zervas, Panagiotis; Patras University
Mporas, Iosif; Patras University
Fakotakis, Nikolaos; Patras University

Abstract:
Fujisaki's modeling of pitch contour for the task of emotion recognition from speech signals, is considered in this article. For the evaluation of the resulted attributes we have utilized a decision tree inducer as well as the instance based learning algorithm. The datasets utilized for training the classification models, were extracted from two emotional speech databases. Results showed that knowledge extracted from Fujisaki's parameters benefited all prediction models. Thus, an average raise of 9.52 % in the total accuracy of all models was attained.

 
Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding24 of 34

Unsupervised Learning of Tone and Pitch Accent

AUTHOR(S):
Levow, Gina-Anne; University of Chicago

Abstract:
Recognition of tone and intonation is essential for speech recognition and language understanding. However, most approaches to this recognition task have relied upon extensive collections of manually tagged data obtained at substantial time and financial cost. In this paper, we explore unsupervised clustering approaches to recognize pitch accent in English and tones in Mandarin Chinese. In unsupervised Mandarin tone clustering experiments, we achieve 57-87 % accuracy on materials ranging from broadcast news to clean lab speech. For English pitch accent in broadcast news materials, results reach 78 %. These results indicate that the intrinsic structure of tone and pitch accent acoustics can be exploited to reduce the need for costly labeled training data for tone learning and recognition.

 
Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding25 of 34

Classification of Statement and Question Intonations in Mandarin

AUTHOR(S):
Liu, Fang; The University of Chicago
Surendran, Dinoj; The University of Chicago
Xu, Yi; University College London & Haskins Laboratories

Abstract:
Conflicting reports abound in the literature regarding the critical characteristics of statement and question intonations in Mandarin. In this paper, decision trees with three different sets of feature vectors are implemented to determine the most SPEECH PROSODY 2006 81 significant elements in an utterance that signify its sentence type (statement vs. question). For 10-syllable utterances, the highest correct classification rate (85 %) is achieved when normalized (to remove the effects of speaker, tone, and focus) final F0's of the 7th and the last syllables are included in the tree construction. This performance is close to previously reported human performance (89 %) for the same testing set. The results confirm the previous finding that the difference between statement and question intonations in Mandarin is manifested by an increasing departure from a common starting point toward the end of the sentence.

 
Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding26 of 34

Perceptual Optimization of the Chinese Accent-Index Detector

AUTHOR(S):
Zhu, Weibin; Institute of Information Science, Beijing Jiaotong University

Abstract:
For a TTS system, only if a large size of corpus annotated with AI (Accent Index) is available, could it be practicable to build an AI-supported prosody module in a data-driven method. An approach had been proposed to label Chinese AI automatically. Although preliminary experiments showed its effectiveness and efficiency of the approach, there are still certain issues left unsolved: the evaluation and the optimization of the AI detector. A small size of sub-corpus has been labeled with AI manually, which is expected to be as a reference for evaluating the performance. And a measure CC (Correlative-Coefficient), the CC between the auto-detected and the manual-annotated AI set, is proposed as the criteria for optimizing the detector. Thanks to the use of CC, the detector has not only been refined and optimized, but also the auto-detected AI has been assigned with prosody meaning subjectively.

Poster Session 5 (PS 5): Speech Technology - Part III: Annotation and Speech Corpus Creation
Thursday, May 4, 14:30 - 16:00
Chairs: Keikichi Hirose / Plinio Barbosa
 
Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation27 of 34

The Prosodizer - Automatic Prosodic Annotation of Speech Synthesis Databases

AUTHOR(S):
Braunschweiler, Norbert; Speech Technology Group, CRL Toshiba Europe Ltd.

Abstract:
Prosodic annotations are used for locating and characterizing prominent parts in utterances as well as identifying and describing boundaries of coherent stretches of speech. In speech synthesis prosodic annotations can be used to improve the unit selection process and subsequently yield more natural sounding synthesis. A method for automatic prosodic annotations of speech is described in this paper. This method is implemented in a computer program called Prosodizer that integrates acoustic features of F0 and RMS as well as syntactic and segmental information like POS tags and syllable boundaries. Design and preliminary performance results are described.

 
Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation28 of 34

Automatic Accent Annotation with Limited Manually Labeled Data

AUTHOR(S):
Chen, YiNing; Microsoft Research Asia
Lai, Min; Department of Electronic Engineering & Information Science, University of Science & Technology of China
Chu, Min; Microsoft Research Asia
Soong, Frank K.; Microsoft Research Asia
Zhao, Yong; Microsoft Research Asia
Hu, Fangyu; Department of Electronic Engineering & Information Science, University of Science & Technology of China

Abstract:
In this paper we investigate automatic accent labeling procedure by using classifiers trained from limited manually labeled data. Different methods are proposed and compared in a framework of multi-classifiers, including: a linguistic classifier, an acoustic classifier and a combined one. The linguistic classifier is first used to label POS-determined content words as accented and function words as unaccented. The corresponding labels are then used to train accented and unaccented vowel HMMs separately. The combined classifier is then used to combine the decisions of the linguistic and acoustic classifiers' outputs to minimize labeling errors. The performance can be further improved when the acoustic classifier is re-trained with the whole corpus which is re-labeled by the combined classifiers. The final accent labeling accuracy is improved to 94.0 %. Compared with 97.2 %, the self-agreement ratio of a well-trained human annotator, this accuracy is fairly satisfactory.

 
Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation29 of 34

Prosodic boundaries in spontaneous Russian: perceptual annotation and auto-matic classification

AUTHOR(S):
Nesterenko, Irina; Laboratoire Parole et Langage

Abstract:
Perceptual experiments with French and Russian speaking subjects were used locate intonation phrase boundaries under different experimental conditions. Once inter-listeners' agreement evaluated, we built an automatic predictor based on human boundary/no-boundary judgments and then evaluated how well the predictor behaves. This predictor operates on acoustic features and we looked for an optimal combination of features to mimic perceptual experiment results.

 
Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation30 of 34

Semi-Automatic Prosodic Transcription of Spoken Spanish in XML

AUTHOR(S):
Velázquez, Eduardo; Freie Universitát Berlin

Abstract:
XML (Extensible Mark-up Language) is designed to represent hierarchical structures; in this case, it shows the structure of the prosodic components of spoken language. The XML-based transcription system proposed here allows the input of 1) the phonetic parameters of F0, intensity and duration of each syllable, their relative variation and standard values to facilitate discrimination and comparison; 2) the distribution of feet; 3) the boundaries and characterization of intonation units and utterances, and 4) other conversational phenomena such as pauses, overlaps, interruptions, etc. This mark-up language is currently being used as an analysis tool for a corpus of digitally-recorded conversations in the Mexican and Iberian vernaculars of spoken Spanish.

 
Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation31 of 34

MalToBI - Building an Annotated Corpus of Spoken Maltese

AUTHOR(S):
Vella, Alexandra; University of Malta
Farrugia, Paulseph-John; University of Malta

Abstract:
Research on the phonetics and phonology, particularly the prosody, of Maltese is limited. This is partly due to the lack of structured resources such as a corpus of spoken Maltese, for use in research. Such a corpus, especially one including some element of prosodic annotation, could be a useful tool for further research on the prosodic structure, amongst other aspects, of Maltese. It could also be important for continuing development of Text-to-Speech resources in the local context. Recognition of the necessity for such a corpus gave rise to MalToBI, a project involving the collection of a relatively small body of spoken Maltese, together with the development of a Tone and Break Indices (ToBI) framework adapted for use with Maltese. This paper outlines some aspects of Maltese prosody, describes the development and design considerations involved in building this corpus and reports on the progress made so far as well and intentions for future work.

 
Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation32 of 34

Shape Display: Task Design and Corpus Collection

AUTHOR(S):
Fon, Janice; Graduate Institute of Linguistics, National Taiwan University

Abstract:
This study introduces a new paradigm for spontaneous dialog elicitation and a small multilingual corpus collected using this paradigm. Pairs of subjects were seated in separate booths and were each given a felt-covered board and a bag of assorted felt pieces of various shapes and colors. The goal was to make the layout of the felt pieces the same on the two boards with the least moves. In order to test how accommodating the paradigm is to cross-linguistic/cross-cultural experimental designs, 32 subjects of three different languages, English, Mandarin (Guoyu and Putonghua), and Japanese participated in the study. Subjects found the paradigm entertaining and engaged themselves in the game without paying much conscious attention to their linguistic performances. The elicited dialogs were spontaneous enough to allow further phonetic and discourse research.

 
Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation33 of 34

Optimization of MFNs for Signal-based Phrase Break Prediction

AUTHOR(S):
Hofmann, Michael; Dresden University of Technology
Jokisch, Oliver; Dresden University of Technology

Abstract:
The automatic prosodic annotation of large speech corpora gains increasing consideration since appropriate databases for the training of prosodic models in speech synthesis and recognition are needed. On linguistic level, correct phrase and accent marking are essential processing steps. The authors developed a neural network based method for signal-based phrase break prediction and tested this method across two different speech databases. The structure of the multilayer feed-forward neural network (MFN) had been optimized and adapted to the target database and to the specific annotation task. The method is rather data sensitive - depending on different human labelers and small differences across training databases, like frequency of occurrence or strength of phrase breaks. The MFN method can be easily adapted to the characteristics of different databases (long or short phrases, special formats like dates or web addresses, etc.). If applied to different databases which contain phrase markers of human experts, phrase break recognition rates vary from 79% up to 97%.

 
Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation34 of 34

Automatic Construction of a Prosodically Rich Text Corpus for Speech Synthesis Systems

AUTHOR(S):
Lambert, Tanya

Abstract:
This paper presents a method for an automatic compilation of a phonologically rich text database, which is used in a concatenative text-to-speech (TTS) synthesis system. In the method described here, linguistic features are predicted from text using Festival's linguistic engine. A set of phonological units for a specific text is compiled from AVLs. The set of phonological units is used in set cover algorithms in conjunction with the corresponding rich transcription of text in order to generate a compact and phonologically rich text corpus. This is an efficient way for generating database prompts with a specific prosodic content; the prompts can then be recorded and converted into voice. The method described here can be used for languages other than English.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Poster Session 6 (PS 6): Prosody and Affect
Thursday, May 4, 14:30 - 16:00
Chair: Jürgen Trouvain
 
Poster Session 6: Prosody and Affect1 of 13

Optical Cues to the Visual Perception of Lexical and Phrasal Stress in English

AUTHOR(S):
Scarborough, Rebecca; Stanford University, USA
Keating, Patricia; University of California, Los Angeles, USA
Baroni, Marco; University of Bologna, Italy
Cho, Taehong; Hanyang University, Korea
Mattys, Sven; University of Bristol, England
Alwan, Abeer; University of California, Los Angeles, USA
Auer, Edward; University of Kansas, USA
Bernstein, Lynne; House Ear Institute

Abstract:
In a study of optical cues to the visual perception of stress, three American English talkers spoke words that differed in lexical stress and sentences that differed in phrasal stress, while video and movements of the face were recorded. In a production analysis, stressed vs. unstressed syllables from these utterances were compared along many measures of facial movement, which were generally larger and faster under stress. In a visual perception experiment, 16 perceivers identified the location of stress in forced-choice judgments of video clips of these utterances (without audio). Phrasal stress (54 % correct vs. 25 % chance) was better-perceived than lexical stress (62 % correct vs. 50 % chance). The relation of the visual intelligibility of the prosody of these utterances to the optical characteristics of their production is discussed, with analysis of which cues are associated with successful visual perception.

 
Poster Session 6: Prosody and Affect2 of 13

Some gender and cultural differences in perception of affective expressions

AUTHOR(S):
Erickson, Donna; Gifu City Women's College

Abstract:
This study investigates whether people can understand vocal affective expression in a language that is not their native language, as well as whether there is a difference in the way males and females understand vocal affective expressions. We investigated the affectively-neutral Japanese word /banana/ as uttered with five different affective expressions: anger, sad, surprised, suspicious, and happy. The listeners were 20 American listeners, 9 Korean listeners, and 20 Japanese listeners who were asked to indicate which affect they heard. The results showed that the perception of affect differed according to the native language as well as to the gender of the listener.

 
Poster Session 6: Prosody and Affect3 of 13

Signalling affect in Mandarin Chinese - the role of non-lexical utterance - final edge tones

AUTHOR(S):
Mueller-Liu, Patricia; Institute of Phonetics, Saarland University

Abstract:
Of the five pitch-phenomena contained in Y.R. Chao's framework of Mandarin Chinese intonation, the phenomenon termed 'successive tonal addition' has proved highly elusive. Using communicatively-based spontaneous speech samples, the first instrumental evidence of successive tonal addition is presented here, found to consist of non-lexical pitch-movements added to the lexical tones of utterancefinal syllables. Investigation into the functions of these phenomena, referred to as 'edge tones', showed these to be affective in nature, signalling emotio-attitudinal messages.

 
Poster Session 6: Prosody and Affect4 of 13

Paralinguistic Effects on Voice Quality: A Study in Japanese

AUTHOR(S):

AUTHOR(S):
Menezes, Caroline; National Institute for Japanese Language
Maekawa, Kikuo; National Institute for Japanese Language

Abstract:
This study analyzes two spectral properties in vowel segments, H1-H2 (related to glottal opening) and H1-A3 (related to the speed of vocal fold closing gesture) in an attempt to infer the voice quality variation associated with different types of paralinguistic information (PI) types. Results suggest that both glottal opening and closing speed of the glottis differ significantly depending on PI. However, for some PI types there were also significant syllable effects. The correlation between pitch (F0) and these two voice parameters was very low leading to the conclusion that just pitch differences cannot account for the observed voice quality variation. Significant differences were also noted for the power of speech waveform (RMS) according to PI. Inter-speaker variation was noted especially for 'suspicion'.

 
Poster Session 6: Prosody and Affect5 of 13

Neutral Speech Corpora - a test for neutrality

AUTHOR(S):
Matte, Ana; UFMG

Abstract:
What is neutral speech? This writing reports the results obtained of research in phonostylistics of Brazilian Portuguese with the objective of determining the necessary experimental conditions for recording so-called neutral speech. The experiment was designed to test these two hypotheses: 1) The phrase, the minimal prosodic unit, is also the minimal unit of meaning in studies of expressing emotion in speech, even when our focus is on the production of complete texts that should be taken as a single unit of meaning. 2) The speaker's reported self-impressions can indicate certain sentences that have been affected by the reactions of the speaker, which conflict with the objective of recording neutral speech, and therefore should be rejected from a corpus of referential speech. The results obtained validated both of the hypotheses and enabled us to formulate a single unique test for neutral speech, recommended for the process of purging of referential corpora in experimental phonology.

 
Poster Session 6: Prosody and Affect6 of 13

Emotion Recognition Using IG-based Feature Compensation and Continuous Support Vector Machines

AUTHOR(S):
Wu, Chung-Hsien; Department of Computer Science and Information Engineering
Chuang, Ze-Jing; Department of Computer Science and Information Engineering

Abstract:
This paper presents an approach to feature compensation for emotion recognition from speech signals. In this approach, the intonation groups (IGs) of the input speech signals are firstly extracted. The speech features in each selected intonation group are then extracted. With the assumption of linear mapping between feature spaces in different emotional states, a feature compensation approach is proposed to characterize the feature space with better discriminability among emotional states. The compensation vector with respect to each emotional state is estimated using the Minimum Classification Error (MCE) algorithm. For the final emotional state decision, the compensated IG-based feature vectors are used to train the Continuous Support Vector Machine (CSVMs) for each emotional state. The CSVM kernel function is experimentally decided as Radial basis function and the experimental result shows the proposed approach can obtain encouraging performance for emotion recognition.

 
Poster Session 6: Prosody and Affect7 of 13

Emotion Recognition in the Noise Applying Large Acoustic Feature Sets

AUTHOR(S):
Schuller, Bjoern; Technische Universitaet Muenchen
Arsic, Dejan; Technische Universitaet Muenchen
Wallhoff, Frank; Technische Universitaet Muenchen
Rigoll, Gerhard; Technische Universitaet Muenchen

Abstract:
Speech emotion recognition is considered mostly under ideal acoustic conditions: acted and elicited samples in studio quality are used besides sparse works on spontaneous field-data. However, specific analysis of noise influence plays an important factor in speech processing and is practically not considered hereon, yet. We therefore discuss affect estimation under noise conditions herein. On 3 wellknown public databases - DES, EMO-DB, and SUSAS - effects of post-recording noise addition in diverse dB levels, and performance under noise conditions during signal capturing, are shown. To cope with this new challenge we extend generation of functionals by extraction of a large 4k hi-level feature set out of more than 60 partially novel base contours. Such comprise among others intonation, intensity, formants, HNR, MFCC, and VOC19. Fast Information-Gain-Ratio filter-selection picks attributes according to noise conditions. Results are presented using Support Vector Machines as classifier.

 
Poster Session 6: Prosody and Affect8 of 13

Speech Rates in French Expressive Speech

AUTHOR(S):
Beller, Grégory; IRCAM
Hueber, Thomas; IRCAM
Schwarz, Diemo; IRCAM
Rodet, Xavier; IRCAM

Abstract:
Expressive speech is a useful tool in cinema, theater and contemporary music. In this paper we present a study on the influence of expressivity on the speech rates of a french actor. It involves a relational database containing expressive and neutral spoken french. We first describe the analysis partly based on a unitselection Text-to-Speech system. The range of data permits a statistical approach to the speech rate. A dynamic description of the french speech rate is offered which demonstrates its evolution in speech. Finally, several results are given concerning pauses and breathing that help to distinguish between anger and happiness.

 
Poster Session 6: Prosody and Affect9 of 13

Temporal Interaction of Emotional Prosody and Emotional Semantics: Evidence from ERPs

AUTHOR(S):
Paulmann, Silke; Max Planck Institute for Human Cognitive and Brain Sciences
Kotz, Sonja; Max Planck Institute for Human Cognitive and Brain Sciences

Abstract:
Emotional prosody helps us to understand how other people feel. Also, emotions are transferred verbally. In order to further substantiate the underlying mechanisms of emotional prosodic processing we investigated the interaction of both emotional prosody and semantics with event-related brain potentials (ERPs) utilizing a prosodic and interactive (prosodic/semantic) violation paradigm. Results suggest that the time-course of the two channels differ. While a pure emotional prosodic violation elicited a positivity between 450 ms and 600 ms, a violation of both emotional prosody and semantics elicited a negativity between 500 ms and 650 ms. This suggests that emotional prosody and emotional semantics follow a different time-course. This holds true for all six emotional prosodies investigated. Also, the obtained results suggest that emotional prosody and semantics contribute differentially during the interaction of both information types.

 
Poster Session 6: Prosody and Affect10 of 13

Voiced and Unvoiced Content of fear-type emotions in the SAFE Corpus

AUTHOR(S):
Clavel, Chloé; THALES Research and Technology
Vasilescu, Ioana; ENST-TSI
Richard, Gaël; ENST-TSI
Devillers, Laurence; LIMSI-CNRS

Abstract:
The present research focuses on the development of a fear detection system for surveillance applications based on acoustic cues. The emotional speech material used for this study comes from the previously collected SAFE Database (Situation Analysis in a Fictional and Emotional Database) which consists of audiovisual sequences extracted from movie fictions. We address here the question of a specific detection model based on unvoiced speech. In this purpose a set of features is considered for voiced and unvoiced speech. The salience of each feature is evaluated by computing the Fisher Discriminant Ratio for fear versus neutral discrimination. This study confirms that the voiced content and the prosodic features in particular are the most relevant. Finally the detection system merges information conveyed by both voiced and unvoiced acoustic content to enhance its performance. fear is recognized with 69.5 % of success.

 
Poster Session 6: Prosody and Affect11 of 13

Attitudinal Patterns in Brazilian Portugese Intonation: Analysis and Synthesis

AUTHOR(S):
de Morães, Joao Antônio; Universidade Federal do Rio de Janeiro
Stein, Cirineu Cecote; Universidade Federal do Rio de Janeiro

Abstract:
The main goal of this paper is to investigate the prosodic manifestation of the following attitudinal states: consideration, despair, disappointment, irony, justification, obviousness, and uncertainty. The sentence O Carlos Alberto já sabe. [Carlos Alberto already knows it.] was pronounced by a subject, who tried to convey each of these attitudes. Afterwards, it was presented to 20 panelists, which were asked to identify the original intention of each enunciation. The attitudes were, in general, correctly identified. The acoustic analysis revealed that the attitudinal patterns make use of distinct prosodic parameters in their manifestation: some are linked to segmental duration, be it global or localized; in other cases, the decisive prosodic component is the fundamental frequency. Auditory tests using speech resynthesis turned it possible to evaluate the relative weight of the prosodic characteristics identified in the analysis.

 
Poster Session 6: Prosody and Affect12 of 13

Comparing vocal parameters in spontaneous and posed child-directed speech

AUTHOR(S):
Schaeffler, Felix; Department of Philosophy and Linguistics, Umeå
Kempe, Vera; Department of Psychology, Stirling University
Biersack, Sonja; Department of Psychology, Stirling University

Abstract:
Research on the facial expression of emotion distinguishes between correlates of posed vs. spontaneous emotion expression. Similar research in the vocal domain is lacking. In this study, we compare changes in a range of vocal parameters between posed vs. spontaneous adult-directed (AD) and child-directed (CD) speech. CDS is a highly affectively charged speech register which lends itself well to the study of posed vs. spontaneous emotion expression. A group of mother addressed an adult and their child, and a group of non-mothers addressed an imaginary adult and an imaginary child. The results confirm adjustments in pitch, formants and speech rate typically reported for CDS in both groups. At the same time, they show that source parameters not in service of linguistic function, such as shimmer (perturbations in fundamental period amplitude) and harmonics-to-noise ratio show clear group effects suggesting that they may constitute veridical indicators of spontaneous emotion expression.

 
Poster Session 6: Prosody and Affect13 of 13

How prosodic attitudes can be false friends: Japanese vs. French social affects

AUTHOR(S):
Shochi, Takaaki; ICP
Aubergé, Véronique; ICP
Rilliard, Albert; ICP

Abstract:
The attitudes of the speaker during a verbal interaction are affects linked to the speaker intentions, and are built by the language and the culture. They are a very large part of the affects expressed during an interaction, voluntary controlled, This paper describes several experiments which show that some attitudes own both to Japanese and French and are implemented in perceptively similar prosody, but that some Japanese attitudes don't exist and/or are wrongly decoded by French listeners. Results are presented for 12 attitudes and three levels of language (naive, beginner, intermediary). It must particularly be noted that French listeners, naive in Japanese, can very well recognize admiration, authority and irritation; that they don't discriminate Japanese question and declaration before the intermediary level, and that the extreme Japanese politeness is interpreted as impoliteness by French listeners, even when they can speak a good level of Japanese.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Oral Session 5 (OS 5): Prosody in Pathology and Ageing
Thursday, May 4, 16:00 - 17:40
Chair: Joan Ma
Oral Session 5: Prosody in Pathology and Ageing1 of 5

Ageing and Speech Prosody

AUTHOR(S):
Zellner Keller, Brigitte; Institut de Psychologie, UNIL

Abstract:
Ageing is part of the normal evolution of human beings. Demographic projections to 2030 indicate that more than 60 countries will have at least 2 million people age 65 or older. Yet knowledge about speech in the elderly is still dispersed and incomplete, in particular in the area of normal ageing. Prosody within a linguistic community is triggered by a number of parameters which are investigated (see this conference). Yet, little is currently known about the longitudinal evolution of this speech component. This paper is a first state of the art about speech prosody and ageing, with the hope that more researchers in speech sciences will investigate this domain.

 
Oral Session 5: Prosody in Pathology and Ageing2 of 5

Evaluation of Tracheoesophageal Substitute Voices Using Prosodic Features

AUTHOR(S):
Haderlein, Tino; Universität Erlangen-Nürnberg
Nöth, Elmar; Universität Erlangen-Nürnberg
Schuster, Maria; Universität Erlangen-Nürnberg
Eysholdt, Ulrich; Universität Erlangen-Nürnberg
Rosanowski, Frank; Universität Erlangen-Nürnberg

Abstract:
Tracheoesophageal (TE) speech is a possibility to restore the ability to speak after laryngectomy, i.e. after the removal of the larynx. TE speech often shows low audibility and intelligibility which makes it a challenge for the patients to communicate. In speech rehabilitation the patient's voice quality has to be evaluated. As no objective classification means exists until now and an automation of this procedure is desirable, we performed initial experiments for automatic evaluation using prosodic features. Our reference were scoring results for several evaluation criteria for TE speech from five experienced raters. Correlation coefficients of up to 0.84 between human and automatic rating are promising for future work.

 
Oral Session 5: Prosody in Pathology and Ageing3 of 5

Functionality and perceived atypicality of expressive prosody in children with autism spectrum disorders

AUTHOR(S):
Peppe, Sue; Queen Margaret University College, Edinburgh
Martinez Castilla, Pastora; Universidad Autonoma Madrid
Lickley, Robin; Queen Margaret University College, Edinburgh
Mennen, Ineke; Queen Margaret University College, Edinburgh
McCann, Joanne; Queen Margaret University College, Edinburgh
O'Hare, Anne; University of Edinburgh
Rutherford, Marion; 2Royal Hospital for Sick Children, Edinburgh

Abstract:
People with autism are perceived to have 'odd' prosody, but is it malfunctioning? A new prosody test assesses the functionality of prosody in four aspects of speech (phrasing, affect, turn-end and focus) by tasks that elicit utterances in which prosody alone conveys the meaning. The test was used with 100 typicallydeveloping children (TD), 39 with Asperger's syndrome (AspS) and 31 with highfunctioning autism (HFA). In results, HFA < TD on all six tasks, HFA < AspS on four, and AspS < TD on one. In perception experiments, judges rated the atypicality of the prosody in samples of conversation from participants in each of the three groups. Correlation between the judges' ratings was high, and ANOVAs showed differences between groups similar to those found in the test results. The ratings correlated significantly (mainly at the 0.01 level) with the test's output scores. The findings support the ecological validity of the test for use as a clinical assessment tool.

 
Oral Session 5: Prosody in Pathology and Ageing4 of 5

Dysprosody in Parkinson's disease: Musical scale production and intonation patterns analysis

AUTHOR(S):
Rigaldie, Karine; Laboratoire Jacques Lordat
Nespoulous, Jean-Luc; Laboratoire Jacques Lordat
Vigouroux, Nadine; IRIT

Abstract:
This article aims to acquire a better knowledge of prosody disturbances in Parkinson disease via an acoustic analysis. Our aim is twofold. Firstly, to identify phonetic and prosodic parameters that are specific of such a pathology. Secondly, to study the effect of a pharmacological treatment (based on dopamine) on these patients' speech production. In order to determine the effect of dopamine, oral productions of 8 parkinsonian patients have been collected, in the OFF and ON states, and have then been compared to those of control subjects. The specific aim of this study is (a) to examine the ability of patients to handle the variations in fundamental frequency of their voice as well as to master the rise in frequency required by the task (i.e. production of the musical scale and intonation patterns) and (b) to measure the palliative effects that can be induced, at least partly, in the management of frequency by a treatment based on L-Dopa.

 
Oral Session 5: Prosody in Pathology and Ageing5 of 5

Consonant and Vowel Duration in Parkinsonian Speech

AUTHOR(S):
Duez, Danielle; CNRS

Abstract:
The current study compared consonant and vowel duration in speech read by 10 French Parkinsonian speakers and 10 control speakers. The results show a different impact of Parkinson's disease (PD) on speech segments. Consonants were shortened in PD speech while vowels were significantly longer. This results of the concomitance of articulatory movements of reduced amplitude and orofacial bradykinesia. As a consequence syllabic productions are of the same duration in PD speech as in normal speech. The durational contrast of consonants was maintained, for vowels there was less agreement with the normal pattern of intrinsic duration, especially for high vowels.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Special Session 4 (SPS 4): Prosody in Automatic Speech Recognition
Organizer: Sin-Horng Chen
Friday, May 5, 09:00 - 11:00
Special Session 4: Prosody in Automatic Speech Recognition1 of 6

Recognizing Mandarin Chinese Fluent Speech Using Prosody Information - An Initial Investigation

AUTHOR(S):
Tseng, Chiu-yu; Institute of Linguistics, Academia Sinica

Abstract:
By applying our hierarchical prosody framework for fluent speech that specifies boundary breaks and boundary information, we were able to recognize speech paragraphs and various levels of prosodic units within each such paragraph. These recognized prosodic units are not unrelated speech units but rather, sister constituents that entail higher-up syntactic as well semantic relationships that cumulatively make up fluent continuous speech. Note how this top-down approach differs from most bottom-up approaches. The former offers information from higher up linguistic association whereas the latter treats identified Chinese syllables as discrete unrelated units or lexical words at most, leaving structural information unaddressed. We believe using top-down prosody information may very well offer new breaking ground in fluent speech recognition.

 
Special Session 4: Prosody in Automatic Speech Recognition2 of 6

Detection of Fillers Using Prosodic Features in Spontaneous Speech Recognition of Japanese

AUTHOR(S):
Hirose, Keikichi; The University of Tokyo
Abe, Yu; The University of Tokyo
Minematsu, Nobuaki; The University of Tokyo

Abstract:
A new scheme of detecting fillers in spontaneous speech recognition was developed. When a filler hypothesis appears during the decoding process, a prosodic module checks the morpheme, hypothesized as filler, and outputs the filler likelihood score. When the likelihood score exceeds a threshold, a prosodic score is added to the language score of the hypothesis as a bonus. The prosodic module is constructed using five-layered perceptron. A comparative recognition experiment with and without the prosodic module was conducted for 100 utterances of spontaneous speech of Japanese. Seven fillers originally miss-recognized as non-fillers are correctly recognized as fillers when the prosodic module is used. No fillers originally recognized as fillers are wrongly recognized as non-fillers. Although a few non-filler morphemes are miss-recognized as other non-filler morphemes by the introduction of the prosodic module, they can be corrected by properly setting parameters of the recognizer.

 
Special Session 4: Prosody in Automatic Speech Recognition3 of 6

A New Approach of Using Temporal Information in Mandarin Speech Recognition

AUTHOR(S):
Yang, Jyh-Her; National Chiao Tung University
Liao, Yuan-Fu; National Taipei University of Technology
Wang, Yih-Ru; National Chiao Tung University
Chen, Sin-Horng; National Chiao Tung University

Abstract:
In this paper, a new approach of using temporal information to assist in Mandarin speech recognition is discussed. It incorporates two types of temporal information into the recognition search. One is a statistical syllable duration model which considers the influences of 411 base-syllables, 5 tones, 4 position-in-word factors, and 3 position-in-sentence factors on syllable duration. Another is the timing information of modeling three types of inter-syllable boundary including intra-word, inter-word without punctuation mark (PM), and inter-word with PM. The uses of these two types of temporal information are expected to be useful for improving the segmentation accuracies in both acoustic decoding and linguistic decoding. Experimental results showed that the base-syllable/character/word recognition rates were slightly improved for both MATBN and Treebank datbase.

 
Special Session 4: Prosody in Automatic Speech Recognition4 of 6

Exploiting Glottal and Prosodic Information for Robust Speaker Verification

AUTHOR(S):
Liao, Yuan-Fu; National Taipei University of Technology, Taiwan
Zeng, Zhi-Ren; National Taipei University of Technology, Taiwan
Chen, Zi-He; National Central University, Taiwan
Juang, Yau-Tarng; National Central University, Taiwan

Abstract:
In this paper, three different levels of speaker cues including the glottal, prosodic and spectral information are integrated together to build a robust speaker verification system. The major purpose is to resist the distortion of channels and handsets. Especially, the dynamic behavior of normalized amplitude quotient (NAQ) and prosodic feature contours are modeled using Gaussian of mixture models (GMMs) and two latent prosody analyses (LPAs)-based approaches, respectively. The proposed methods are evaluated on the standard one speaker detection task of the 2001 NIST Speaker Recognition Evaluation Corpus where only one 2-minute training and 30-second trial speech (in average) are available. Experimental results have shown that the proposed approach could improve the equal error rates (EERs) of maximum a priori-adapted (MAP)-GMMs and GMMs+T-norm approaches from 12.4 % and 9.5 % to 10.3 % and 8.3 % and finally to 7.8 %, respectively.

 
Special Session 4: Prosody in Automatic Speech Recognition5 of 6

Affect-Robust Speech Recognition by Dynamic Emotional Adaptation

AUTHOR(S):
Schuller, Bjoern; Technische Universitaet Muenchen
Stadermann, Jan; Technische Universitaet Muenchen
Rigoll, Gerhard; Technische Universitaet Muenchen

Abstract:
Automatic Speech Recognition fails to a certain extent when confronted with highly affective speech. In order to cope with this problem we suggest dynamic adaptation to the actual user emotion. The ASR framework is built by a hybrid ANN/HMM mono-phone 5k bi-gram LM recognizer. Based hereon we show adaptation to the affective speaking style. Speech emotion recognition takes place prior to the actual recognition task to choose appropriate models. We therefore focus on fast emotion recognition based on low extra feature extraction effort. As databases for proof-of-concept we use a single digit task and sentences from the well-known WSJ-corpus. These have been re-recorded in acted neutral and angrily speaking style under ideal acoustic conditions to exclude other influences. Effectiveness of acoustic emotion recognition is also proved on the SUSAS corpus. We finally evaluate the need of adaptation and demonstrate significant superiority of our dynamic approach to static adaptation.

 
Special Session 4: Prosody in Automatic Speech Recognition6 of 6

Improved Large Vocabulary Mandarin Speech Recognition Using Prosodic Fea-tures

AUTHOR(S):
Huang, Jui-Ting; National Taiwan University
Lee, Lin-shan; National Taiwan University

Abstract:
This paper presents a new framework for improved large vocabulary Mandarin speech recognition using prosodic features. The prosodic information is formulated in a probabilistic model well compatible to the conventional maximum a posteriori (MAP) framework for large vocabulary speech recognition. A set of prosodic features considering the special characteristics of Mandarin Chinese is developed, and both syllable-level and prosodic-word-level prosodic models are trained with the decision tree algorithm. A two-pass recognition process is used, in which each word arc in the word graph outputted by the first pass is rescored in the second pass using the two prosodic models. The experiments show the reasonable improvements in recognition accuracy. This approach does NOT require a prosodic labeled training corpus and works for the large-scale speaker-independent task.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Special Session 5 (SPS 5): Articulatory-Functional Approaches to Speech Prosody
Organizer: Yi Xu
Friday, May 5, 11:20 - 13:20
Special Session 5: Articulatory-Functional Approaches to Speech Prosody1 of 4

The Roles of Physiology, Physics and Mathematics in Modeling Prosodic Features of Speech

AUTHOR(S):
Fujisaki, Hiroya; Professor Emeritus, The University of Tokyo

Abstract:
This paper presents the author's view on prosody, information, and models, as well as on the roles of physiology, physics and mathematics in modeling, and describes the theoretical and experimental bases of the command-response model for the mechanisms of F0 contour generation, which has been extensively used in the analysis and synthesis of F0 contours of utterances of various languages. Although the model represents only those factors that are inherent to the control mechanism of F0, it allows one to identify those factors that carry communicative functions of speech as input commands and as parameters of the mechanism.

 
Special Session 5: Articulatory-Functional Approaches to Speech Prosody2 of 4

Planning Compensates for the Mechanical Limitations of Articulation

AUTHOR(S):
Kochanski, Greg P.; The University of Oxford
Shih, Chilin; University of Illinois at Urbana-Champaign

Abstract:
We explore a simple model of speech articulation. The model consists of an articulator combined with the ability to remember and improve the neural drive signal for the articulator. Over many productions, the system learns a neural drive signal that provides an accurate match for acoustically-defined targets. In fact, the match can be better than expected, yielding narrower regions of coarticulation than the intrinsic muscle Fresponse time. Further, despite the time delay introduced by the muscle, the articulatory response has no time delay, because the learned neural drive signal occurs in advance of changes in the acoustic targets. Finally, we test the model against tonal production data from Mandarin conversation, and show that it can represent non-trivial surface intonation patterns with simple and linguistically reasonable targets.

 
Special Session 5: Articulatory-Functional Approaches to Speech Prosody3 of 4

What is Emphasis and How is it Coded?

AUTHOR(S):
Kohler, Klaus J.; Institute of Phonetics and Digital Speech Processing (IPDS), Christian-Albrechts-University at Kiel

Abstract:
The meaning category emphasis is examined with regard to its semantic, pragmatic, and affective components and their prosodic coding in German, English, and Dutch. In particular, a distinction is made between emphasis for focus, which singles out elements of discourse by making them more salient than others, and emphasis for intensity, which intensifies the meaning contained in the elements. To evaluate intensity negatively a force accent comes into play, which is signalled by non-pitch features. The question of universals is also addressed.

 
Special Session 5: Articulatory-Functional Approaches to Speech Prosody4 of 4

Speech prosody as articulated communicative functions

AUTHOR(S):
Xu, Yi; University College London

Abstract:
Speech prosody, just like the segmental aspect of speech, conveys communicative meanings by encoding functional contrasts. The contrasts are realized through articulation, a biomechanical process with specific constraints. Prosodic phonology or any other theory of prosody therefore cannot be autonomous from either communicative functions or biophysical mechanisms. Successful modeling of speech prosody can be achieved only if communicative functions and biophysical mechanisms are treated as the core rather than the margins of prosody.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Poster Session 7 (PS 7): Cross-linguistic Studies and Prosodic Variability
Friday, May 5, 14:40 - 16:10
Chair: Kjell Gustafson
Poster Session 7: Cross-linguistic Studies and Prosodic Variability1 of 26

Stress Patterns of Complex German Cardinal Numbers

AUTHOR(S):
Wagner, Petra; Universität Bonn
Paulson, Meike; Universität Bonn

Abstract:
German cardinal numbers show variable stress patterns on the phonetic surface. Former studies showed that these cannot be explained by stress shift. In a combined production and perception study, the hypothesis is tested that German cardinal numbers are of a hybrid phonological nature: sentence medially, they behave like compounds following the CSR, while they behave like phonological phrases following the NSR when occurring phrase finally. The hypotheses were tested and for the majority of cases.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability2 of 26

The Temporal Structure of Penta- -and Hexasyllabic Words in Estonian

AUTHOR(S):
Lippus, Pärtel; University of Tartu
Pajusalu, Karl; University of Tartu
Teras, Pire; University of Tartu

Abstract:
This article concentrates on five- and six-syllable Estonian words consisting of two or more metric feet of the first quantity degree (Q1), comparing the temporal structures of the feet. After an introductory discussion of the problems related to secondary stressed feet, the article first of all deals with half-length of unstressed syllables in Q1 feet. This is followed by an analysis of durations and duration ratios of primary and secondary stressed Q1 feet of five- and six-syllable words. It appears that in these long words the temporal structure of Q1 feet is not similar. It differs from the structure of Q1 feet of shorter (di- to tetrasyllabic) words where there is a significant lengthening of the unstressed vowel (V2). The results show that in Estonian the whole structure of prosodic word determines the temporal structure of feet.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability3 of 26

Intonational Differences in Lombard Speech: Looking Beyond F0 Range

AUTHOR(S):
Welby, Pauline; Institut de la Communication Parlée

Abstract:
Previous studies on speech in noise have generally reported an increase in fundamental frequency (F0). I examine three other potential intonational differences: choice of intonation pattern, tonal scaling, and tonal alignment. Seven French speakers read short paragraphs in quiet and in 80 dB white noise. Four speakers increased F0 range across the target accentual phrases in noise. Six speakers upscaled individual tones; there was great inter-speaker variability in tonal scaling, in contrast with an earlier study on Dutch. No influence of noise on intonation pattern type was found; there was no tendency to produce more "early rises" in noise, even though these rises are cues to word segmentation. Producing an early rise (thus a LHLH or LHH pattern) may not add to the salience of the commonly produced LH pattern. In addition, no difference in tonal alignment was found, in contrast to the findings of an earlier study. This may be due to paradigm differences between the studies.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability4 of 26

A Perceptual Study on Variability in Break Allocation within Chinese Sentences

AUTHOR(S):
Chu, Min; Microsoft Research Asia
Dong, Honghui; Institute of Automation, CAS
Tao, Jianhua; Institute of Automation, CAS

Abstract:
This paper investigates the variability of break allocations within Chinese sentences by perceptual experimentation. The results confirm the existence of prosodic chunks. We have found that (1) prosodic chunks are the basic units in the rhythmic organization of Chinese utterances (breaks can generally be allocated by chunk boundaries and breaks placed within a chunk will significantly decrease the naturalness of synthesized speeches); (2) given prosodic chunks, multiple break solutions are acceptable. Furthermore, breaks can be allocated by chunk boundaries using simple rules that impose a length-balance constraint without considering the syntax or semantic structure of a sentence.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability5 of 26

Contextual Variability of Third-Tone Sandhi in Taiwan Mandarin

AUTHOR(S):
Chen, Chun-Mei; University of Texas at Austin

Abstract:
This study investigates the phonetic property of Third-Tone Sandhi in Taiwan Mandarin and the effects of contextual variability. The goal of this study is to provide empirical evidence for the description of Tone 2 (T2) and Tone 3 (T3) in Taiwan Mandarin and further to account for the phonetic features of T2 and T3 in Third-Tone Sandhi Contexts. The results show that isolated T2 is different from isolated T3 in Taiwan Mandarin. The phonetic T2 (< /T3/) derived from Third- Tone Sandhi Rule in Sandhi Context has more raising effect than the underlying T2 in the same Sandhi Context. The greater raising effect of the T3 in Sandhi Context was supported by its longer vowel duration. Third-Tone Sandhi Rule turns T3T3 into T2T3, and anticipatory dissimilation enhances the raising effect on the Sandhi.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability6 of 26

Voice quality variations throughout the study of the accent of Liverpool

AUTHOR(S):
Coadou, Marion; Laboratoire Parole et Langage

Abstract:
Voice quality is a term which is frequently used by phoneticians, however defining it precisely is quite difficult. The voice quality of a speaker is the result of the interaction between organic and phonetic factors (Abercrombie, D., 1967 and Laver, J., 1980). The organic factors may refer, for example, to the size or the shape of the vocal tract. The phonetic factors, which are studied here, can be due to muscular adjustments learnt by the speakers in their social environment. First of all, this study proposes a definition of some key-concepts in order to understand voice quality. Then, the corpus is analysed thanks to the Vocal Profile Analysis Scheme. This pilot study on four subjects from Liverpool shows that it is possible to observe variations of voice quality between various accents of the British Isles.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability7 of 26

Prosodic Structure Affects the Production and Perception of Voice-Assimilated German Fricatives

AUTHOR(S):
Kuzla, Claudia; Max-Planck-Institut für Psycholinguistik
Ernestus, Mirjam; Max-Planck-Institut für Psycholinguistik
Mitterer, Holger; Max-Planck-Institut für Psycholinguistik

Abstract:
Prosodic structure has long been known to constrain phonological processes. More recently, it has also been recognized as a source of fine-grained phonetic variation of speech sounds. In particular, segments in domain-initial position undergo prosodic strengthening, which also implies more resistance to coarticulation in higher prosodic domains. The present study investigates the combined effects of prosodic strengthening and assimilatory devoicing on word-initial fricatives in German, the functional implication of both processes for cues to the fortis-lenis contrast, and the influence of prosodic structure on listeners' compensation for assimilation. Results indicate that 1. Prosodic structure modulates duration and the degree of assimilatory devoicing, 2. Phonological contrasts are maintained by speakers, but differ in phonetic detail across prosodic domains, and 3. Compensation for assimilation in perception is moderated by prosodic structure and lexical constraints.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability8 of 26

Is there a distinction between H+!H* and H+L* in standard German? Evidence from an acoustic and auditory analysis

AUTHOR(S):
Rathcke, Tamara; Institute of Phonetics and Digital Speech Processing Kiel
Harrington, Jonathan; Institute of Phonetics and Digital Speech Processing Kiel

Abstract:
This paper is concerned with intonation in German and whether there is a phonological distinction between two types of early peaks H+L* and H+!H*. Speech perception and production data are presented to shed light on this issue. The results show little evidence for a phonological distinction between these categories. The results are interpreted in terms of the relationship between downstep and early peak placement in German.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability9 of 26

Acoustic Differentiation of L-and L-L% in Switchboard and Radio News Speech

AUTHOR(S):
Kim, Heejin; University of Illinois at Urbana-Champaign
Yoon, Tae-Jin; University of Illinois at Urbana-Champaign
Cole, Jennifer; University of Illinois at Urbana-Champaign
Hasegawa-Johnson, Mark; University of Illinois at Urbana-Champaign

Abstract:
Acoustic evidence for a distinction between low-toned intermediate (ip) and intonational phrase (IP) boundaries is presented from two speech corpora representing spontaneous, conversational speech and scripted broadcast speech. Robust effects of the two boundary levels are found in the phrase-final syllable rime in both corpora. Nucleus duration is longer and the F0 value at rime end is lower at IP boundaries compared to ip boundaries. Glottalization is also more frequent before an IP boundary. Other effects of boundary level on the F0 and intensity contours over the phrase-final rime are evident but variable across the two corpora. These findings support the Beckman-Pierrehumbert theory of intonation (Beckman and Pierrehumbert 1986) in its recognition of two levels of prosodic phrasing.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability10 of 26

Additive Effects of Phrase Boundary on English Accented Vowels

AUTHOR(S):
Lee, Eun-Kyung; University of Illinois at Urbana-Champaign
Cole, Jennifer; University of Illinois at Urbana-Champaign
Kim, Heejin; University of Illinois at Urbana-Champaign

Abstract:
This paper investigates cumulative effects of strengthening and lengthening on English vowels across two prominence-bearing prosodic factors, phrasal accent and prosodic phrase boundary. F1, F2 and duration measures are compared across vowels in three prosodic contexts: ip-medial unaccented, ip-medial accented, and ip-final accented. The results show that for most vowels there is only one degree of vowel strengthening, conditioned by phrasal accent, without any additive strengthening effect of prosodic phrase boundary. Lengthening is observed in both accent and added phrase boundary conditions, and the effect is consistently cumulative for at least some vowels, suggesting a gradient increase of duration as a function of the strength of prosodic structure. This finding also provides compelling evidence that strengthening and lengthening effects are two independent mechanisms that serve to mark prosodically strong positions.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability11 of 26

Is irregular phonation a reliable cue towards the segmentation of continuous speech in American English?

AUTHOR(S):
Surana, Kushan; MIT
Slifka, Janet; MIT

Abstract:
This paper analyzes the potential use of irregular phonation as a cue for the segmentation of continuous speech. The analysis is conducted on two dialect regions of the TIMIT database which consists of read, isolated utterances. The data set encompasses 114 speakers resulting in 1331 hand-labeled irregular tokens. The study shows that 78 % of the irregular tokens occur at word boundaries and 5 % occur at syllable boundaries. Of the irregular tokens at syllable boundaries, 72 % are either at the junction of a compound-word (e.g "outcast") or at the junction of a base word and a suffix. Of the irregular tokens which do not occur at word or syllable boundaries, 70 % occur adjacent to voiceless consonants mostly in utterance-final location. These observations support irregular phonation as an acoustic cue for syntactic boundaries in connected speech. Detection of regions of irregular phonation could improve speech recognition and lexical access models. [Work supported by NIH # DC02978.]

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability12 of 26

A preliminary study of prosodic patterns in two varieties of suburban youth speech in France

AUTHOR(S):
Le Gac, David; University of Rouen
Jamin, Mikaël; Nottingham University
Iryna, Lehka; University of Rouen

Abstract:
This paper presents the first results of a research on the prosodic specificities of French speakers living in two poor multi-ethnic suburbs located in the north of Paris and in Rouen. The emphasis is on the acoustic analysis and the comparison of some particular prosodic patterns which are frequently used in the suburban youth speech. We show that there is no noteworthy difference between speakers from both suburbs. In particular, we found that both groups of speakers use rise-fall patterns associated with short syllables at the end of IP. This pattern is atypical in standard French, and its presence in both groups suggests that it constitutes a prosodic marker that is essential to the suburban accent identification.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability13 of 26

Evidence for 'soft' preplanning in tonal production: Initial scaling in Romance

AUTHOR(S):
Prieto, Pilar; ICREA-Universitat Autónoma de Barcelona
D'Imperio, Mariapaola; CNRS-Université de Provence
Elordieta, Gorka; Euskal Herriko Unibertsitatea
Frota, Sónia; Universidade de Lisboa
Vigário, Marina; Universidade do Minho

Abstract:
In this study, the scaling of utterance-initial f0 values and H initial peaks are examined in several Romance languages as a function of phrasal length. The motivation for this study stems from contradictory claims in the literature regarding whether the height of the initial f0 values and peaks is governed by a look-ahead or preplanning mechanism. A total of ten speakers of five Romance language varieties (Catalan, Italian, Standard and Northern European Portuguese, and Spanish) read a total of 3720 declarative utterances (744 utterances per language) of varying length in number of pitch accents and syllables. The data reveal that the majority of speakers tend to begin higher in longer utterances. The failure to find a correlation between phrase length and initial scaling for all speakers within languages shows that we are dealing with soft preplanning (in [3]'s terms), that is, an optional production mechanism that may be overridden by other tonal features.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability14 of 26

A scaling contrast in Majorcan Catalan interrogatives

AUTHOR(S):
Vanrell, Maria del Mar; Universitat Autónoma de Barcelona

Abstract:
This paper reports the application of the Categorical Perception Paradigm (CP) to a pith height contrast in Majorcan Catalan. The first hypothesis is that pitch height is the primary perceptual cue in distinguishing yes-no questions from whquestions inMajorcan Catalan. The second hypothesis predicts that, as in previous studies, the application of the CP involves the presence of order of presentation effects in the results of the discrimination task. The results show that the primary perceptual cue is the presence of upstep in yes-no questions and confirm the existence of an order of presentation effect that deserves further investigation.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability15 of 26

Morphotonology for TTS in Niger-Congo languages

AUTHOR(S):
Gibbon, Dafydd; Universität Bielefeld
Urua, Eno-Abasi; University of Uyo

Abstract:
Many East Asian languages have lexical (i.e. phonemic) prosody; African languages are also frequently mentioned as tone languages. However, tone functionality in African tone languages is fundamentally morphosyntactic rather than phonemic: (a) tonal pattern types are restricted to particular parts of speech, (b) tones may be inflectional and play a role in (c) derivational and (d) compounding word formation patterns, and (e) in syntactic phrasal templates. The aim of this paper is to document the morphosyntactic functionality of tones in African languages within a typological context as compared to East Asian tone languages such as Mandarin, and to develop finite state architectures for tone handling in practical Text-To-Speech synthesis in health and agriculture information projects in Ivory Coast and Nigeria. Morphosyntactic tone is illustrated for Ibibio (Lower Cross, South-Eastern Nigeria).

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability16 of 26

Non- and Quasi-lexical Realizations of 'Positive Response' in Korean, Polish and Thai

AUTHOR(S):
Karpiński, Maciej; Adam Mickiewicz University
Kleśta, Janusz; Adam Mickiewicz University
Szalkowska, Emilia; Adam Mickiewicz University

Abstract:
This paper presents a basic comparative study of Korean, Polish and Thai short words, quasi-words and vocalizations used to perform the dialogue moves collectively referred to as "positive responses" in map task dialogues. Some of these units are produced as non-linguistic vocalizations, while others are "fully legitimate" linguistic entities. The frequencies of occurrence for the analyzed units were quite high and similar for the three languages. The numbers of expression categories were almost identical. However, the tendencies found in the Korean and Thai intonational contours were more distinct than for Polish. The inventories of units for all the three languages included borrowings. The nasal vocalization mhm not only ranked among the most popular expression categories for each of the languages, but was also consistently produced with a rising contour. The normalized pitch change was remarkably higher in the Polish expressions than in the Korean and Thai units.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability17 of 26

Replicating in Naxi (Tibeto-Burman) an Experiment Designed for Yorùbá: An Approach To 'Prominence-Sensitive Prosody' vs. 'Calculated Prosody'

AUTHOR(S):
Michaud, Alexis; Laboratoire de Phonétique et Phonologie (UMR 7018) CNRS/ Paris 3 Sorbonne Nouvelle

Abstract:
An experiment originally designed to investigate the tones of Yorùbá (H, M and L) is here replicated for Naxi, a Tibeto-Burman language which likewise has H, M and L tones. The data consist in sentences in which all syllables bear the same tone. For Naxi, the stylisation of the F0 curves raises difficulties that were apparently not present in Yorùbá: in Naxi, intonational junctures are manifested by lengthening and a downward tilt in F0 which may not be adequately captured by the two-point stylisation used for Yorùbá. The typological discussion suggests that there may be a continuum between (i) the 'calculated prosody' of languages such as Ngamambo, whose prosodic structure hinges on the calculation of a tone sequence, and (ii) the 'prominence-sensitive prosody' of languages such as English, Chinese or Vietnamese (and to a lesser extent Naxi), in which intonation appears to reflect phrasing and informational structure in a flexible, typically noncategorical way.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability18 of 26

Pitch and Voice Quality Characteristics of the Lexical Word-Tones of Tamang, as Compared with Level Tones (Naxi data) and Pitch-plus-Voice-Quality Tones (Vietnamese data)

AUTHOR(S):
Michaud, Alexis; Laboratoire de Phonétique et Phonologie (UMR 7018) CNRS/ Paris 3 Sorbonne Nouvelle
Mazaudon, Martine; LACITO, UMR 7107 CNRS/ Paris 3 & 4

Abstract:
The tones of Tamang (Sino-Tibetan family) involve both F0 and voice quality characteristics: two of the four tones (tones 3 and 4) were reported to be breathy in studies from the 1970s. For the present research, audio and electroglottographic data were collected from 5 speakers in their 30s or 40s. Voice quality is estimated by computing the glottal open quotient. The present results (bearing on 788 syllables) show that in the speech of three speakers, tones 3 and 4 have a higher open quotient (providing an indirect cue to breathiness) than tones 1 and 2. The difference in open quotient between the four tones for the other two speakers is negligible or inconsistent. The Tamang data are compared with similar data from Naxi, which possesses level tones, and from Vietnamese, which possesses pitch-plus-voice-quality tones. The results appear to confirm that Tamang tones possess several correlates; they offer an insight on ongoing change in the prosodic system of Tamang.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability19 of 26

The intonation of Banyumas Javanese

AUTHOR(S):
Stoel, Ruben; Universiteit Leiden

Abstract:
I will present an analysis of the intonation of the Banyumas dialect of Javanese (an Austronesian language spoken in Indonesia), based on the autosegmental-metrical framework. As Javanese is a language without word stress, I assume that there are no pitch accents. Accentual Phrases (AP) are marked by boundary tones. A H% tone marks the end of a pre-nuclear AP, while the nuclear AP ends in a HL%, LH%, or HL0% tone. This tone marks the end of the focus. Any postfocal material appears in an encliticized AP. This material must correspond to a syntactic XP. Contrastive focus at the word level is possible in only a few special constructions.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability20 of 26

Syllable cut and energy contour: a contrastive study of German and Hungarian

AUTHOR(S):
Mády, Katalin; Institute of German Studies, Pázmány Péter Catholic University
Tronka, Krisztián Z.; Institute of German Studies, Pázmány Péter Catholic University
Reichel, Uwe D.; Department of Phonetics and Speech Communication, University of Munich

Abstract:
Syllable cut is said to be a phonologically distinctive feature in some languages where the difference in vowel quantity is accompanied by a difference in vowel quality like in German. There have been several attempts to find the corresponding phonetic correlates for syllable cut, from which the energy measurements of vowels by Spiekermann proved appropriate for explaining the difference between long and short vowels. On this basis, we intended to compare German as a syllable cut language and Hungarian where the feature was not expected to be relevant. However, the phonetic correlates of syllable cut found in this study do not entirely confirm Spiekermann's results. It seems that the energy features of vowels are more strongly connected to their duration than to their quality.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability21 of 26

Lexical Stress Realisation: Native vs. ESL Speech

AUTHOR(S):
Jian, Hua-Li; National Cheng Kung University

Abstract:
English stress placement in phrase-medial and phrase-final is investigated. Current results indicate that Taiwanese ESL learners realise polysyllabic words that carry various degrees of stress in two prosodic positions with considerable differences relative to the native American English speakers, and the differences are demonstrated from acoustical and phonetic perspectives.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability22 of 26

Acoustic and perceptual cues for compound-phrasal contrasts in Vietnamese

AUTHOR(S):
Nguyen, Thu; University of Queensland
Ingram, John; University of Queensland

Abstract:
This paper reports two experiments that examined the acoustic and perceptual cues that Vietnamese use to distinguish between compounds and noun phrases. 15 minimal sets of the two patterns classified into three different word/phrase types (noun-adjective (hoa [flower] hôong [pink]: pink flower), noun-verb (bò [ox] cày [plough]: ox ploughing), and noun-noun (bàn [table] giây [paper]: paper table) were recorded in two experimental conditions: one with a picture-naming task and one with a minimal pair sentence task by 45 Vietnamese native speakers of 3 dialects (Hanoi, Hue, and Saigon). In a perception task, the meaning of the patterns is identified in a forced choice test by 15 listeners. The results showed that while there is evidence that Vietnamese use juncture and pre-pausal lengthening to distinguish between compounds and phrases, no significant acoustic and perceptual evidence was found to support a claim for contrastive stress patterns between compounds and noun phrases in Vietnamese.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability23 of 26

Pitch Range is not Pitch Range

AUTHOR(S):
Ulbrich, Christiane; University of Ulster

Abstract:
This paper presents a phonetic analysis of pitch range as perceived and measured on utterance and syllable level. A previous analysis of read speech showed that German speakers produced a larger pitch-range on utterance level, whereas Swiss German speakers produced a larger pitch-range on syllable level. This analysis was based on the production of broadcasters reading news messages and a fairytale, both stylistically very restricted and largely standardized. Therefore, in the present study semi- and spontaneous utterances are analyzed to provide evidence that these findings are cross-linguistic rather than discourse-specific. The evidence was provided by auditory annotation and acoustic measurements.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability24 of 26

Pitch range variation in child affective speech

AUTHOR(S):
Grichkovtsova, Ioulia; QMUC
Mennen, Ineke; QMUC

Abstract:
This study investigates pitch range variation in the affective speech of bilingual and monolingual children. Cross-linguistic differences in affective speech may lead bilingual children to express emotions differently in their two different languages. A cross-linguistically comparable corpus of 6 bilingual Scottish-French children and 12 monolingual peers was recorded according to the developed methodology. The results show that the majority of children use pitch range measurements (overall level and span) to realize differences between some emotions. Monolingual children use analyzed acoustic parameters in a much more homogeneous way than bilinguals. Some results of bilingual children do not strictly correspond to those of monolinguals, and show bidirectional interference.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability25 of 26

The Effect of Glottalization on Voice Preference

AUTHOR(S):
Ding, Hongwei; Dresden University of Technology
Jokisch, Oliver; Dresden University of Technology
Hoffmann, Rüdiger; Dresden University of Technology

Abstract:
The impact of phrasal prosody on glottalization is documented in many publications. Besides prosodic boundary and stress, other influencing factors such as the speaking style have been studied. The work reported here examines the relationship between the objective preference of listeners and the occurrence of speaker's glottalization. The speech data in six languages were used for the multilingual speaker selection in speech synthesis and have been compiled to listening test phrases. Additional experiments, concerning the influence of reading style on glottalization, were conducted with prosodically constant words or phrases in two languages. Evaluating the statistics from this investigation, we can come to following conclusions: (a) The occurrence and degree of glottalization can be different across speakers. (b) As an prosodic effect, glottalization is NOT undesired for speakers. (c) A well-defined reading style can increase the occurrence.

 
Poster Session 7: Cross-linguistic Studies and Prosodic Variability26 of 26

Transcribing intonational variation at diffferent levels of analysis

AUTHOR(S):
Post, Brechtje; University of Cambridge
Delais-Roussarie, Elisabeth; CNRS / Université de Paris 7

Abstract:
In the transcription system for Intonational Variation (IVTS, derived from IViE), prosodic features are transcribed on (1) the rhythmic tier, (2) the local phonetic tier, (3) the global phonetic tier, and (4) the phonological tier. Each tier offers a range of labels which share a general architecture, but language-specific parameters determine which subset of labels a transcriber can choose from for the transcription of a particular language variety, and how the different tiers are associated with one another. In this paper, we will argue that the multi-linear architecture of IV-based systems offers transparency, flexibility and standardization, three key advantages in qualitative and quantitative studies of intonational variation across languages and language varieties.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Poster Session 8 (PS 8): Language Acquisition and Learning, Conversational Speech, and Neural Processing
Friday, May 5, 14:40 - 16:10
Chair: Nobuaki Minematsu
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing1 of 17

Acquisition of Prosody in a Spanish-English Bilingual Child

AUTHOR(S):
Kim, Sahyang; Wayne State University
Andruski, Jean; Wayne State University
Casielles, Eugenia; Wayne State University
Nathan, Geoff; Wayne State University
Work, Richard; Wayne State University

Abstract:
This study examined the pattern of prosodic phrasing and the distribution of post-lexical pitch accent types in a Spanish-English bilingual child. We collected utterances from natural interactions between parents and the child, and analyzed them using MAE ToBI and SP ToBI. We compared prosodic development across ages, and compared the child's speech production with parents' productions. Results showed that both the child and parents divide their short utterances into smaller prosodic phrases and that most content words bear post-lexical pitch accent, which can make the word segmentation task easy for children. The majority of the child's English words was produced with H*. This was similar to his father's pitch accent pattern, but he produced a higher number of H* than his father. He could produce the L+H* Spanish nuclear pitch accent with a similar frequency to that found in his input, but could not produce as many L*+H as his mother in the prenuclear pitch accent context.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing2 of 17

Intonation Phrasing in Chinese EFL Learners' Read Speech

AUTHOR(S):
Chen, Hua; Nantong University

Abstract:
Intonation phrasing refers to the system of intonational choices that a speaker has when associating complete intonation patterns with a text. The number of patterns and the boundaries may vary and convey different meanings. This study investigates the intonation phrasing patterns in Chinese EFL learners' read speech. Recordings of 45 Chinese students were compared with those of 8 British native speakers. The recorded speech was annotated and analyzed on the computer with PRAAT, and the learners' prosodic features were compared with those of native speakers in order to find the non-native like aspects in learners' oral performance. Findings show that learners differ from native speakers in 1) the frequency of boundary markers, and 2) the realization of some tonality constraints. The study has important implications for China's EFL pedagogy as well as for the improvement of rating rubrics for China's oral English tests.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing3 of 17

Prosodic characteristics in the Speech of Chinese EFL learners

AUTHOR(S):
Makarova, Veronika; University of Saskatchewan
Zhou, Xia; University of Saskatchewan

Abstract:
This study reports some prosodic characteristics in the quasi-spontaneous classroom speech of Chinese EFL learners. Recordings of ten dialogues produced by twenty second-year non-English majors were analyzed to extract the following features: durations of inter- and intra-turn pauses, duration of filled-in pauses, numbers of words per tone unit, tone unit durations, speech rates and pitch accent type (tone) statistics. The deviations from standard native speech in the areas of tonality and tonicity are also considered. The paper offers some practical suggestions aimed at improving the prosodic characteristics of the English speech of Chinese EFL learners.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing4 of 17

A Rhythmic Analysis on Chinese EFL Speech

AUTHOR(S):
Li, Aijun; Institute of Linguistics, Chinese Academy of Social Sciences
Yin, Zhigang; Institute of Linguistics, Chinese Academy of Social Sciences
Zu, Yiqing; MOTOROLA Research Center China

Abstract:
This paper, based on a phonetic experiment, depicts a contrastive study on the rhythmic pattern of Chinese learners of English as a foreign language (CL2) as compared with that of the native speakers of both standard British and American English (EL1) in their respective pitch accent distribution patterns, prosodic structures and duration patterns.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing5 of 17

Unstressed vowels in non-native German

AUTHOR(S):
Gut, Ulrike; English Department, University of Freiburg

Abstract:
Vowel reduction and deletion are prominent correlates of stress in German and some preliminary investigations have suggested that this constitutes an area of difficulty for non-native speakers. This paper explores the production of vowels in unstressed syllables by learners of German, focusing especially on the acoustic properties duration and formant structure. It is shown that the realization of unstressed vowels in non-native German is influenced by the speakers' native language (L1), but not by speaking style.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing6 of 17

Native Intuitions of Speakers of a Lexical Accent System in L2 Acquisition of Stress. The Case of Russian Learners of Polish

AUTHOR(S):
Kijak, Anna; Utrecht Institute of Linguistics OTS

Abstract:
Native speakers of a lexical accent system (Russians) were tested on their L2 acquisition of a phonological stress system (Polish). In Russian, a sizeable part of the lexicon is marked underlyingly for accents and claims on the position of default stress vary. This makes it interesting to investigate which L1 characteristics (distribution of lexical accents vs. phonological default) are transferred to L2 (if any). 35 Russian subjects were tested on their L2 production of Polish stress. The data shows a very consistent and almost uniform source of mistakes: the stem-final position. These results mirror one of the claims on the default stress in Russian suggesting that L2 errors originated from L1 transfer of that default. L1 transfer generally did not reflect the distribution of lexical accents (though the latter were not completely excluded, they were restricted in their type). Results on the individual level show various subjects possibly followed two different L2 learning paths.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing7 of 17

Using Prosodic and Voice Quality Features for Paralinguistic Information Extraction

AUTHOR(S):
Ishi, Carlos Toshinori; ATR/IRC
Ishiguro, Hiroshi; ATR/IRC
Hagita, Norihiro; ATR/IRC

Abstract:
The use of voice quality features in addition to prosodic features is proposed for automatic extraction of paralinguistic information (like speech acts, attitudes and emotions) in dialog speech. Perceptual experiments and acoustic analysis are conducted for monosyllabic utterances spoken in several speaking styles, carrying a variety of paralinguistic information. Acoustic parameters related with prosodic and voice quality features potentially representing the variations in speaking styles are evaluated. Experimental results indicate that prosodic features are effective for identifying some groups of speech acts with specific functions, while voice quality features are useful for identifying utterances with an emotional or attitudinal expressivity.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing8 of 17

A trial of communicative prosody generation based on control characteristic of one word utterance observed in real conversational speech

AUTHOR(S):
Greenberg, Yoko; GITS, Waseda University
Shibuya, Nagisa; GITS, Waseda University
Tsuzaki, Minoru; Kyoko City University of Arts
Kato, Hiroaki; ATR Human Information Science Labs
Sagisaka, Yoshinori; GITS, Waseda University

Abstract:
Aiming at prosody control for conversational speech synthesis, communicative prosodies were generated based on the prosodic characteristics derived from one word utterance "n". Firstly huge amount of "n" recorded in an actual environment were analyzed using F0 generation model to see what kind of prosodic variations could exist and how they were generated. Based on the results of the analysis of "n", simple conversion rules to other speaking styles expressing three dimensions in perceptual impressions, confident-doubtful, allowable-unacceptable and positive-negative were established. Finally, naturalness evaluation test was conducted to see how effectively the prosody conversion rules derived from "n" could be applied to authentic phrases. The results showed validity of the application of the conversion rules to actual phrases. This indicates the possibility of systematic prosody control for conversational speech synthesis using corpus-based approach.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing9 of 17

Intonational cues to discourse structure in Bari and Pisa Italian: perceptual evidence

AUTHOR(S):
Savino, Michelina; University of Bari
Grice, Martine; University of Cologne
Gili Fivela, Barbara; University of Lecce
Marotta, Giovanna; University of Pisa

Abstract:
Perception experiments for Bari and Pisa Italian showed that listeners can reliably distinguish final and non-final utterances in discourse by means of intonation. Bari listeners were also able to distinguish a third category, signalling that the end of the discourse unit is approaching (penultimate position). This was not the case for Pisa listeners.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing10 of 17

The intonation of polar questions in two central varieties of Italian

AUTHOR(S):
Giordano, Rosa; University of Naples Federico II - University of Salerno

Abstract:
A growing attention is given nowadays to the contrastive analysis of prosodic structures and melodies (see, among others, considerations by Ladd 1996 or studies by Grabe and other scholars and Peters et al. 2004). This paper presents a contrastive analysis of question tunes: it is dealt with two regional varieties of Italian (Lazio and Umbria) represented by a sample of map-task dialogues collected in Rome and Perugia. A consistent similarity emerges, as these varieties not only share the same intonative forms but also the same positional constraints as well as the same distribution of the marked prosodic devices. Furthermore, different accent types seems to be related to different kinds of questions. Differences between the two varieties are found in the presence and the use of some accentual and edge tones.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing11 of 17

Interaction of verb accentuation and utterance finality in Bangla

AUTHOR(S):
Dutta, Indranil; University of Illinois at Urbana-Champaign
Hock, Hans Henrich; University of Illinois at Urbana-Champaign

Abstract:
In this study we present data from three experiments that present robust, unambiguous evidence that Bangla conforms to the cross-linguistic avoidance of prominence on utterance-final verbs in SOV languages.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing12 of 17

Two contours, two meanings: the intonation of jaja in German phone conversations

AUTHOR(S):
Golato, Andrea; University of Illinois at Urbana-Champaign
Fagyal, Zsuzsanna; University of Illinois at Urbana-Champaign

Abstract:
This paper shows that jaja 'yes yes' sequences in German conversations carry two distinct interactional meanings cued by their intonation and sequential placement. Combined Conversation Analytic (CA) and Intonation Phonological analyses indicate that jaja tokens uttered with H* L-% intonation (following GToBI) convey that the previous speaker has persisted too long in a specific course of (verbal) action which should therefore be stopped. By contrast, jaja tokens with L+H* L-% intonation are used in situations of fractured intersubjectivity, i.e., immediately after speakers misalign: with the jaja turn, its speaker treats the action/content of the previous speaker's utterance as either unwarranted or self-evident. Speaking rate and regional dialectal differences notwithstanding, the two types of contour show significantly different peak alignment, and correspond to two distinct 'peak accent' nuclear contours.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing13 of 17

The Prosody of Suspects' Responses during Police Interviews

AUTHOR(S):
Fadden, Lorna; Simon Fraser University

Abstract:
This paper reports on the results of a pilot study on the prosody of Western Canadian suspects' speech as it occurs during the course of investigative interviews with police. Suspects' responses are categorized according to the type of information they contain, and the prosodic characteristics of each response type are described. It will be shown in this exploratory study that the various response types pattern consistently across a group of suspects and that it is possible to construct a set of prosodic profiles consisting of pitch range, average pitch, speech rate and hesitation values associated with each response type.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing14 of 17

Prosodic signalling of (un)expected information in South Swedish - An interactive manipulation experiment

AUTHOR(S):
Ambrazaitis, Gilbert; Center for Languages and Literature, Lund University

Abstract:
Starting from the German pitch peak timing categories and their communicative functions, it is asked how these functions would be expressed in South Swedish. The aim is to get a first impression as regards potentially relevant prosodic parameters associated with the expression of expected vs. unexpected information in South Swedish. For that, an interactive manipulation experiment is conducted, where subjects manipulate the pitch contour and duration of monosyllabic test utterances until the sound output adequately represents a given communicative function. Swedish has a tonal word accent distinction, and all test words have accent 1, normally produced with an early pitch fall. It is thus hypothesized that in South Swedish, expected vs. unexpected information will not be expressed through a different pitch peak timing, as in German. The results indeed clearly hint at unexpected information being signalled by means of a higher, rather than a later pitch peak.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing15 of 17

An fMRI study of multimodal deixis: preliminary results on prosodic, syntactic, manual and ocular pointing

AUTHOR(S):
Carota, Francesca; Institut de la Communication Parlée, UMR CNRS 5009, INPG, Univ. Stendhal, Grenoble
Lœvenbruck, Hélène; Institut de la Communication Parlée, UMR CNRS 5009, INPG, Univ. Stendhal, Grenoble
Vilain, Coriandre; Institut de la Communication Parlée, UMR CNRS 5009, INPG, Univ. Stendhal, Grenoble
Baciu, Monica; Laboratoire de Psychologie et NeuroCognition, UMR CNRS 5105, UPMF, Grenoble
Abry, Christian; Institut de la Communication Parlée, UMR CNRS 5009, INPG, Univ. Stendhal, Grenoble
Lamalle, Laurent; INSERM IFR nº 1, RMN biomédicale, Unité IRM 3T, CHU de Grenoble
Pichat, Cédric; Laboratoire de Psychologie et NeuroCognition, UMR CNRS 5105, UPMF, Grenoble
Segebarth, Christoph; Unité Mixte INSERM / Univ. J. Fourier, U594, Grenoble

Abstract:
Deixis or pointing plays a crucial role in language acquisition and speech communication. In this paper we present an innovative fMRI approach in order to examine deixis, conceived as a unitary communicative strategy which employs different verbal and non-verbal speech devices to achieve the pragmatic goal of bringing relevant information to the interlocutors' attention. We designed a unified fMRI paradigm for multimodal deixis, integrating four conditions of verbal and non-verbal pointing: 1) prosodic focus, 2) syntactic extraction, 3) index finger pointing, 4) eye pointing. Sixteen subjects were examined while they gave oral, manual and ocular responses inside the 3T magnet imager. Preliminary results based on a random effect analysis with a group of 8 subjects show that all pointing conditions recruit a left parieto-frontal network, with respect to the control condition. The findings suggest that different modalities of deixis depend on a common cerebral network.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing16 of 17

The Use of Multi-pitch Patterns for Evaluating the Positive and Negative Valence of Emotional Speech

AUTHOR(S):
Cook, Norman D.; Kansai University
Fujisawa, Takashi X.; Kansai University

Abstract:
We report the application of a psychophysical model of harmony perception to the analysis of speech intonation. The model was designed to reproduce the empirical findings on the perception of musical chords, but it does not depend on specific musical scales or tuning systems. Application to speech intonation produces values corresponding to the total dissonance, tension and affective valence among the dominant pitches used in the speech utterance.

 
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing17 of 17

The neural mechanisms for understanding self and speaker's mind from emotional speech: an event-related fMRI study

AUTHOR(S):
Homma, Midori; Graduate School of Comprehensive Scientific Research
Imaizumi, Satoshi; Graduate School of Comprehensive Scientific Research
Maruishi, Masaharu; Hiroshima Prefectural Rehabilitation Center, Hiroshima
Muranaka, Hiroyuki; Hiroshima Prefectural Rehabilitation Center, Hiroshima

Abstract:
Using linguistically positive and negative words uttered either pleasantly or unpleasantly by four speakers, we examined the brain regions that mediate speech communication through event-related functional magnetic resonance imaging (fMRI) analyses. Subjects were adult listeners who evaluated either speakers' mind, their own mind, or (as a control condition) the number of letters for spoken stimuli which were randomly presented through ear phones. In both the self and speaker-mind judgment tasks, the dorsal medial prefrontal cortex (dMPFC), that has been implicated in theory of mind or self-referential processing, is significantly activated, in addition to the classical cortical regions involved in processing linguistic semantics and emotional prosody of speech. These results suggest that the mental state attribution accomplished by the dorsal medial prefrontal cortex plays an important role to understand our own and speaker's mind in speech communication.

 
Abstracts
Plenary Talks | SPS1 | SPS2 | SPS3 | SPS4 | SPS5 | OS1 | OS2 | OS3 | OS4 | OS5 | PS1 | PS2 | PS3 | PS4 | PS5 | PS6 | PS7 | PS8 | Vitrine
Exhibition From the Historic Acoustic-phonetic Collection of the TU Dresden
Tuesday to Friday
1 of 1

Measuring Pitch with Historic Phonetic Devices

AUTHOR(S):
Mehnert, Dieter; Technische Universität Dresden
Hoffmann, Rüdiger; Technische Universität Dresden

Abstract:
Measuring pitch is one of the most important but also most difficult tasks in experimental phonetics. It is interesting to study how the difficulties have been solved in the times before the computer was introduced in the phonetic laboratories. In this paper, this is discussed using a number of exhibits of the acoustic-phonetic collection of the Dresden University. There will be a small exhibition of historic devices at the conference Speech Prosody 2006. This paper is intended to accompany the exhibition.