Question 5/XII - Speech synthesis/recognition systems
(continuation of Question 5/XII studied in 1985-88)

Considering

that speech synthesis/recognition systems will be exploited to control
the access to the telephone network, to data bases or other functions
through the telephone network, agrees to study the subjective
acceptability of such systems, from the viewpoint of the performance of
single devices as well as of the whole interactive system/service;

decides to put the following Question, the study of which should be
organized as follows:

synthesis and recognition systems should be studied in the
telecommunication environment taking into account particularly the
characteristics normally found there, for example:

1) bandwidth and/or bit rate,

2) loss and level of signals,

3) distortion,

4) noise.

The Question can be segregated into the following categories:

1) Voice synthesis in telephony

Definition of synthesis
Vocabulary size
Intelligibility
Naturalness
Listener effort

2) Voice recognition in telephony

Vocabulary size
Input format (isolated words or continuous speech)
Correct recognition and rejection ratio
Robustness to background and circuit noise and distortions
Speaker dependency
Language/Dialect dependency
Training time and procedure
Recognition time

3) Specific items which should be studied are:

a) Which characteristics can be quantitatively measured and which
assessment procedures are suggested?

b) Can acceptability ranges be recommended?

c) Can standardized speech data bases be established to enable the
testing of recognizers and synthesizers in the telephony environment?

d) How will administrations deal with the multi-language problem?

Study Group XII asks if Study Group II might wish to study:

4) Synthesis/recognition interactive services

Input format (syntactic requirements)
Error correction, ease and time required
Response time
Feedback response mode (audio/visual)
Overall friendliness
Applications
ANNEX 1

(to Question 5/XII)

List of documents for Question 5/XII, study period 1985-1988

COM XII-15, June 1985 (British Telecom): Early contribution to propose
new question on speech recognition and synthesis: method for assessing
isolated word speaker dependent recognition systems.

Annex B to the reply to Question 18/XII, in Report COM XII-R 12,
September 1986 (Liaison Officer between Study Group XII and Study Group
XVIII): Status report on Study Group XVIII/8 (Speech processing)

Annex to the reply to Question 5/XII in Report COM XII-R 12, September
1986 (CSELT, Italy): Subjective assessment of automatic voice answering
devices

COM XII-148, February 1987 (France): An "objective" evaluation of
difficulty in understanding voice synthesis devices

COM XII-176, June 1987 (Sweden): Subjective quality assessment of
synthetic speech
ANNEX 2

(to Question 5/XII)

Preliminary reply to Question 5/XII, in COM XII-R 29,
February 1988, 2.4
ANNEX 3

(to Question 5/XII)

Contribution COM XII-176 with the following amendments:

-Page 2, between second and third paragraphs, add the following
paragraph: "An average listening level was expected to be within the
preferred range."

-Page 3, replace last but one paragraph by: "20 subjects participated in
the test. The speech was presented to them monaurally over headphones at
a comfortable listening level (approximately 80 dB sound level as
measured on an artificial ear)."