Question 5/XII - Speech synthesis/recognition systems (continuation of Question 5/XII studied in 1985-88) Considering that speech synthesis/recognition systems will be exploited to control the access to the telephone network, to data bases or other functions through the telephone network, agrees to study the subjective acceptability of such systems, from the viewpoint of the performance of single devices as well as of the whole interactive system/service; decides to put the following Question, the study of which should be organized as follows: synthesis and recognition systems should be studied in the telecommunication environment taking into account particularly the characteristics normally found there, for example: 1) bandwidth and/or bit rate, 2) loss and level of signals, 3) distortion, 4) noise. The Question can be segregated into the following categories: 1) Voice synthesis in telephony Definition of synthesis Vocabulary size Intelligibility Naturalness Listener effort 2) Voice recognition in telephony Vocabulary size Input format (isolated words or continuous speech) Correct recognition and rejection ratio Robustness to background and circuit noise and distortions Speaker dependency Language/Dialect dependency Training time and procedure Recognition time 3) Specific items which should be studied are: a) Which characteristics can be quantitatively measured and which assessment procedures are suggested? b) Can acceptability ranges be recommended? c) Can standardized speech data bases be established to enable the testing of recognizers and synthesizers in the telephony environment? d) How will administrations deal with the multi-language problem? Study Group XII asks if Study Group II might wish to study: 4) Synthesis/recognition interactive services Input format (syntactic requirements) Error correction, ease and time required Response time Feedback response mode (audio/visual) Overall friendliness Applications ANNEX 1 (to Question 5/XII) List of documents for Question 5/XII, study period 1985-1988 COM XII-15, June 1985 (British Telecom): Early contribution to propose new question on speech recognition and synthesis: method for assessing isolated word speaker dependent recognition systems. Annex B to the reply to Question 18/XII, in Report COM XII-R 12, September 1986 (Liaison Officer between Study Group XII and Study Group XVIII): Status report on Study Group XVIII/8 (Speech processing) Annex to the reply to Question 5/XII in Report COM XII-R 12, September 1986 (CSELT, Italy): Subjective assessment of automatic voice answering devices COM XII-148, February 1987 (France): An "objective" evaluation of difficulty in understanding voice synthesis devices COM XII-176, June 1987 (Sweden): Subjective quality assessment of synthetic speech ANNEX 2 (to Question 5/XII) Preliminary reply to Question 5/XII, in COM XII-R 29, February 1988, 2.4 ANNEX 3 (to Question 5/XII) Contribution COM XII-176 with the following amendments: -Page 2, between second and third paragraphs, add the following paragraph: "An average listening level was expected to be within the preferred range." -Page 3, replace last but one paragraph by: "20 subjects participated in the test. The speech was presented to them monaurally over headphones at a comfortable listening level (approximately 80 dB sound level as measured on an artificial ear)."