Supplement No. 10

1 Introduction

2 Receiving frequency response

4 Sending frequency response

5 Sending sensitivity

6 Regulation

7 Impedance presented to the line

8 Sidetone balance impedance

9 Interworking with the existing network

Supplement No. 11

1 Introduction

2 Talker sidetone

Supplement No. 13

1 Introduction

2 Room noise

3 Internal vehicle noise

Supplement No. 14

1 Introduction

2 Impairment reference scale for digital processes

3 Survey of methods

4 Background for the test methods

5 Analysis of test results

Supplement No. 10
1 Introduction
2 Receiving frequency response
4 Sending frequency response
5 Sending sensitivity
6 Regulation
7 Impedance presented to the line
8 Sidetone balance impedance
9 Interworking with the existing network
Supplement No. 11
1 Introduction
2 Talker sidetone
Supplement No. 13
1 Introduction
2 Room noise
3 Internal vehicle noise
Supplement No. 14
1 Introduction
2 Impairment reference scale for digital processes
3 Survey of methods
4 Background for the test methods
5 Analysis of test results
Supplement No. 15
1 Introduction
2 Arrangement of the wideband MNRU
3 Spectrum shaping filter
5 Signal-to-noise ratio
6 Other specifications
Supplement No. 16
1 General
2 Conference room acoustics -- General requirements
3 Ambient noise level considerations
4 Reverberation considerations
5 Microphone type and placement
6 Loudspeaker placement
Supplement No. 17
1 Introduction
2 Method
1 Introduction
2 Composition of the tapes
3 Frequency responses
4 Information about the tapes
5 Results

delim @@

| 5i'

CONSIDERATIONS RELATING TO TRANSMISSION CHARACTERISTICS

FOR ANALOGUE HANDSET TELEPHONES

(Malaga-Torremolinos, 1984; amended Melbourne, 1988)

This Supplement based on reference [9] summarizes available information on how some characteristics for handset telephones can be optimized.

It contains information about sending and receiving sensitivities, frequency responses, sidetone characteristics, influence of impedance and handset dimensions. It must be remembered that there are different ways to make an optimization. For instance the number of degrees of freedom are essential. As there are different opinions in different countries (for instance, the different assumptions made) the results of the optimization will be different. This Supplement touches some of these aspects.

Most Administrations seem to prefer a fairly flat frequency response between 300 Hz and 3400 Hz. This probably derives from the early days of telephone networks, when it was determined that possible pre-emphasis at higher frequencies should be located at the sending end to obtain the best possible overall signal-to-noise performance. If we consider free-field, two-ear listening as a reference (face to face conversation) and assume a frequency-independent (flat) response, we should in principle simulate these conditions also at one-ear telephone listening

Then, at the earphone listening, we should have a frequency response of the earphone as in Figure 1 to simulate the diffraction effect we have at free-field two-ear listening [1]. However, most Administrations seem to prefer a flat response and to put the corresponding correction at the sending end. It may also be easier to construct a receiver with high efficiency if the goal is a flat response. Reference [2] has suggested a response as in Figure 2 optimized for a mean local line. Where mains noise may cause problems, a response with greater loss at lower frequencies, e.g. at 200 Hz and lower frequencies, may be appropriate.

Volume V -- Suppl. No. 10 1

Figure 1 Sup.10, p.1 2 Volume V -- Suppl. No. 10

Figure 2 Sup.10, p.2 3 Receiving sensitivity

Receiving sensitivity today often is represented by values between an RLR of --4 dB and --12 dB respectively.

A further increase of the sensitivity by the use of amplifiers might technically be possible. However, the probability for the audibility of crosstalk will increase with increased sensitivity. Therefore, the information gathered in Recommendation P.16 must be considered and it is doubtful if it can be recommended to increase the sensitivity further beyond an RLR of --12 dB.

Increasing the receiving sensitivity also decreases the margins against the effects of speech-off noise on the connection, e.g. unwanted modulation products from PCM systems. The stability against singing will also be affected.

Having chosen the receiving response to be flat, the sending frequency response can be optimized to give the proper overall characteristic. Reference [3] suggests an optimization achieved by asking the listeners for the ``preferred'' response. The result is shown in Figure 3. Reference [4] suggests a 2 to 3 dB increase per octave with increasing frequency. This result was obtained in tests regarding ``naturalness''. Reference [2] suggests a steeper curve (Figure 4) as a result of an optimization where maximum loudness, minimum listening effort and lowest output level are combined. The degree of freedom used by [2] is of course less than in [3] and [4]. Here we may have a difference in opinion concerning which assumptions we must include in the optimization. If the signal-to-noise ratio is a problem, some decibels could be gained (without overloading) in the way shown by [2]. If there are no signal-to-noise ratio problems, an optimization for best naturalness as in [3] and [4] can be used. Thus, the result will depend on the assumptions.

Different opinions may also exist about the local cable length for which the frequency response should be optimized and if the high frequency loss at long lines should be compensated. Reference [2] suggests optimization of the mean local line which will be optimum to the highest number of subscribers (because of the statistical distribution of cable lengths).

Volume V -- Suppl. No. 10 3

Figure 3 Sup.10, p.3 Figure 4 Sup.10, p.4

4 Volume V -- Suppl. No. 10

The curves according to Figure 4 and [4] give with a flat receiving frequency response an overall characteristic close to what is obtained by the diffraction effect at free-field listening this is probably not the whole explanation to the preferred curves. Even if the receiving responses were flat during sealed measuring conditions, hardly anyone keeps the earphone tight to the ear during conversation. Therefore, the actual responses during conversation probably give some additional low frequency cut-off that certainly has an influence on the results (see also reference [5]).

When we want to choose the sending sensitivity we have one degree of freedom less that at the receiving end. We must consider both the probability of crosstalk and the probability of overloading other parts of the telephone system. Actual output levels from the telephone must be considered. As shown in [6] different output levels for the same SRE-value have been found in different countries. However, the different results show one important feature in common: output levels during normal conversation are generally lower than during reference equivalent measurements. Hopefully we will get better agreement on this point in the future if we use the measuring distance defined in Recommendation P.76, Annex A for loudness rating measurements.

A possibility to increase the sending sensitivity on long lines exists if we use sending regulation dependent on line length. The probability for overloading and the probability for far end crosstalk will not increase if the mean power is kept to the same value as today. See also [2]. The probability of near end crosstalk in the local cable will of course increase and has to be considered.

If regulation is introduced both at sending and receiving, more subscribers may experience an overall loudness rating close to a preferred optimum, i.e. less calls will be rated poor and unsatisfactory. Another reason to introduce regulation is to obtain a better sidetone performance on short and long lines at the same time.

Some considerations concerning this topic are as follows:

-- a conjugate match with the line maximizes the power transferred but creates sidetone problems on short lines and also stability/echo problems on long-distance calls;

-- an image match to the line reduces the range of impedance presented to the exchange and eases the sidetone problem except for short subscriber-lines connected to resistive junction plant (e.g. PCM circuits);

-- an impedance approximating the reference resistance (e.g. 600 ohms) eases standardization problems particularly in respect of alternative uses of the local line for non-speech services, but the optimum in respect of sidetone cannot be attained over the whole range of local line lengths.

References [2], [7] and [11] touch upon this subject.

The degree of sidetone suppression is governed by the following parameters:

-- microphone sensitivity;

-- earphone sensitivity;

Volume V -- Suppl. No. 10 5

-- sidetone balancing arrangement within the telephone instrument circuit;

-- the impedance of the line to which the telephone is connected.

The microphone and earphone sensitivities and the instrument circuit are in part controlled by the required sending and receiving sensitivities. The impedance of the line to which the telephone is connected is not usually within the control of the telephone instrument designer. The only parameter freely available to the telephone designer to control the sidetone level is ZS\dO, the sidetone balance impedance [7], [8], the impedance which when connected to the telephone completely suppresses sidetone (see also ref. [12]). If a transformer hybrid is used in the telephone then the internal balance network impedance is equal to the sidetone-balance impedance ZS\dO modified by the turns ratio of the transformer. However, the concept ZS\dOis not affected if the circuit uses any other form of balancing arrangement instead of a transformer.

6 Volume V -- Suppl. No. 10

The design of new handset telephones to be introduced into the telephone network must take account of the need to give satisfactory transmission on connections to existing local telephone circuits either directly or via the long-distance network. Reference [7] contains information touching upon this aspect.

Reference [10] is an example of a specification used in North America. Guidance for desirable sending and receiving levels are given as well as characteristics to be minimally acceptable for connection to the public switched network. It should be noted that this specification uses IEEE terminology, which is different from that found in CCITT Recommendations.

References

[1] 1972. [2]		CCITT Recommendation Description of the ARAEN , Green Book, Vol. V, Rec. P.41, Fig. 4, ITU, Geneva, CCITT Contribution COM XII-No. 32 (U.K. Post Office), Study Period 1973-1976.
[3] [4] [5] [6]		CCITT Contribution COM XII-No. 22 (Australia), Study Period 1973-1976. GLEISS (N.): Sound transmission quality, Tele. No. 1, 1972, pp. 44-53. CCITT Contribution COM XII-No. 229 (Sweden), Study Period 1985-1988. CCITT Recommendation Subjective effects of direct crosstalk; thresholds of audibility and intelligibility ,

Yellow Book, Vol. V, Rec. P.16, ITU, Geneva, 1981.

[7] 1976. [8]		CCITT Manual Transmission planning of switched telephone networks , Chapter V, Annex 1, ITU, Geneva, RICHARDS (D. \| .): Telecommunications by speech, Chapter 5, Butterworths , London, 1973.
[9] [10] [11] [12]		CCITT Contribution COM XII-No. 105 (LME), Study Period 1973-1976. EIA Specification RS 470. CCITT Contribution COM XII-No. 144 (British Telecom), Study Period 1981-1984. CCITT Handbook on Telephonometry , Geneva, 1987.

SOME EFFECTS OF SIDETONE

(Malaga-Torremolinos, 1984; amended Melbourne, 1988

(referred to in Recommendations P.11 and P.79)

Over a number of years sidetone has been studied in CCITT Study Group XII under Question 9/XII. Some important conclusions have been reached from the point of view of the subscriber in his role as both talker and listener. These conclusions relate to the effect of sidetone on a subscriber, as he hears his own voice, the way his talking level

Volume V -- Suppl. No. 11 7

changes as a result and some effects of sidetone when the subscriber is listening in conditions of moderate to high-level room noise. These effects are summarized in Figures 1 and 3.

Figure 1 shows that there is a preferred range for sidetone when the subscriber is talking under quiet conditions, and that the difference between the sidetone being objectionable or too quiet is of the order of 20 dB. (These results were obtained from talking-only tests and need to be confirmed by conversation tests.) The preferred range lies between 7 and 12 dB, STMR (sidetone masking rating -- Recommendation P.76) [1], [5].

8 Volume V -- Suppl. No. 11

Figure 1 Sup.11, p.

The acceptable range is wider and lies between an STMR of 1 dB and 17 dB, (although it must be stated that increasing STMR to a value greater than 17 dB is likely to affect only the talking level, and that only marginally). This range corresponds to the difference between the two curves at the 50% appraisals level. It is not proposed that the 17 dB figure should in any way be considered a maximum value. However, for an STMR above 20 dB, the connection sounds ``dead''.

For telephone connections where the OLR is in the preferred range, the STMR values may similarly be positioned in the preferred STMR range given above. However, on high loss connections the STMR value should be close to, or even exceed 12 dB. On low loss connections the STMR value may be sometimes permitted to become less than 7 dB, but only rarely should it become as low as 1 dB, e.g. telephone sets with receive volume control. Recommendation G.121 interprets those results for transmission planning purposes.

Figure 2 shows the way in which the talking level changes with sidetone level [1], [2], [3], [4]. These results were obtained by means of conversation tests [6], for a connection close to the preferred overall loss. The speech voltage will also be a function of room noise for the same connection conditions.

Volume V -- Suppl. No. 11 9

Figure 2 Sup.11, p. 3 Listener sidetone

High room noise in the subscriber's environment disturbs the received speech in two ways:

i) noise being picked up by the handset microphone and transmitted to the handset receiver via the electric sidetone path,

ii) noise leaking past the earcap at the handset receiver.

Studies have shown that at low frequencies the earcap leakage path dominates over the electric sidetone path in much the same way as the human sidetone signal does in talker sidetone. The weightings applied in the STMR loudness calculation are therefore applicable and the listener sidetone rating (LSTR, Recommendation P.76) has been developed, which makes use of the room noise sidetone sensitivity (see Recommendation P.64, � 9) in the STMR rating method (Recommendation P.79).

Results of subjective tests from two Administrations [7], [8] (using in this case a mean opinion scale of 0-10) are given in Figure 3. In each case the LSTR was derived by making use of DS\dm(see Recommendations P.10, P.64, P.79 and the Handbook on Telephonometry , � 3.3.17c) to convert the sidetone sensitivities Sm\de\ds\dtto SR\dN\dS\dTbefore calculating LSTR (Australian results) or applied as a weighted correction to STMR (Swedish results) as described in Recommendation G.111, � A.4.3.3. Room noise levels were comparable at 55-59 dBA.

Based upon these results Recommendation G.121 recommends that a value of 13 dB LSTR should be striven for.

The value 13 dB is based on a 10 dB LSTR (which may be considered a minimum value), where no further improvement in mean opinion score was possible by increasing LSTR (Figure 3), plus an allowance of 3 dB reflecting the fact that room noise in some office locations can exceed the values used in these experiments. Other tests (Sweden) have also suggested that a higher figure might be more appropriate.

The value that is satisfactory in a given telephone connection will depend on such factors as the level of room noise, the OLR of the connection, the talking levels used, etc. This is still under study in Question 9/XII.

10 Volume V -- Suppl. No. 11

	References	Figure 3 Sup.11, p.
[1] [2] [3] [4] [5] [6]	CCITT Contribution COM XII-No. 50, Study Period 1977-1980 (ITT). CCITT Contribution COM XII-No. 171, Study Period 1977-1980. CCITT Contribution COM XII-No. 199, Study Period 1977-1980 (Australia). CCITT Contribution COM XII-No. 116, Study Period 1977-1980 (Hungary). CCITT Contribution COM XII-No. 152, Study Period 1981-1984 (NTT). Results of conversation tests sent directly to Special Rapporteur for Question 9/XII, British	Telecom, 1978.
[7] [8]	CCITT Contribution COM XII-No.151, Study Period 1981-1984 (Australia). CCITT Contribution COM XII-No.70, Study Period 1985-1988 (Sweden).

NOISE SPECTRA

(Malaga-Torremolinos, 1984)

(quoted in Recommendations P.44 and P.45 (Orange Book, Volume V)

and Question 24/XII)

(Contribution from British Telecom)

Volume V -- Suppl. No. 13 11

This Supplement gives the descriptions of noise spectra used in the evaluation of telephony transmission performance that are recommended by the CCITT or have been employed in studying questions assigned to Study Group XII.

Controlled environmental noise is used in subjective evaluations such as:

a) AEN determinations as described in Recommendations P.44 [1] and P.45 [2];

b) conversation and listening experiments as described, for example, in Supplement No. 2 [3].

Spectra for two different environments are described, one for room noise and two for internal vehicle noise.

Volume V -- Suppl. No. 13

The room noise should have a power density spectrum corresponding to that published by Hoth [4]. Table 1 gives the spectrum density adjusted in level to produce a reading of 50 dBA on a sound level meter conforming to IEC Recommendation Publication 179 [5]. This is reproduced in Figure 1. This spectrum is independent of level, i.e. for 40 dBA the level in each band will be 10 dB less than that shown in Table 1. Additional information on the power in each 1/3rd octave band is also given in Table 1.

Two spectra representing internal vehicle noise [6], [7] have been recommended for use in the study of Question 24/XII [8] for evaluating mobile radio systems. They are adequately represented by simplified curves [9]; one spectrum for moving vehicles and the other for stationary vehicles. Table 2 gives the spectrum densities together with additional information on the power in each 1/3rd octave band. The spectrum density for moving vehicles is shown in Figure 2 | fIa) and for stationary vehicles in Figure 2 | fIb) . These spectra are independent of level.

Table 3 gives the computed values of the unweighted sound pressure levels for various speeds calculated over the ISO 1/3rd octave frequency bands centred on 63 Hz to 8000 Hz.

H.T. [T1.13]
TABLE 1
Room noise spectrum

center box; cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . Frequency (Hz)

(dB)

} { Total power in each 1/3rd octave band (dB SPL)

} Tolerance (dB) _ cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 100

Spectrum density (dB SPL/Hz) { Bandwidth 10 log 1 0 Df 32.4 13.5 45.9 cw(36p) | cw(54p) | cw(48p) | cw(54p) |

cw(36p) . 125 30.9 14.7 45.5 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 160 29.1 15.7 44.9 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 200 27.6 16.5 44.1 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 250 26.0 17.6 43.6 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 315 24.4 18.7 43.1 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 400 22.7 19.7 42.3 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 500 21.1 20.6 41.7 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 630 19.5 21.7 41.2

cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 800 17.8 22.7 40.4 � | cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 1000 16.2 23.5 39.7 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 1250 14.6 24.7 39.3 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 1600 12.9 25.7 38.7 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 2000 11.3 26.5 37.8 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 2500 9.6 27.6 37.2 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 3150 7.8 28.7 36.5 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 4000 5.4 29.7 34.8 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 5000 2.6 30.6 33.2 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 6300 --1.3 31.7 30.4 cw(36p) | cw(54p) | cw(48p) | cw(54p) | cw(36p) . 8000 --6.6 32.7 26.0

Note 1 -- The electrical input signal, e.g. white noise, shall be band-limited to the 1/3rd octave bands centred on the ISO preferred frequencies (ISO 266) between 100 Hz and 8000 Hz with the band edges conforming to the filters described in IEC 225.

Note 2 -- The acoustical room noise is difficult to control at low frequencies, especially in the unspecified region below 100 Hz because of the dimensions of typical test cabinets, poor attenuation of such cabinets and the influence of extraneous noises, e.g. air-conditioning plant. It is therefore desirable to select a test cabinet that keeps these unwanted low frequency sound pressure levels to a minimum.

Tableau 1 [T1.13], p.8

Volume V -- Suppl. No. 13 13

Figure 1 Sup.13, p.9 Figure 2 Sup.13, p.10

14 Volume V -- Suppl. No. 13

H.T. [T2.13]

TABLE 2

Internal vehicle noise spectra

center box; cw(30p) | cw(36p) sw(30p) | cw(36p) | cw(36p) sw(30p) | cw(30p) , ^ | c | c ^ | | c | c | ^ . Frequency (Hz) Spectrum density (dB SPL/Hz) { Bandwidth 10 log 1 0 Df (dB)

} { Total power in each 1/3 octave band (dB SPL)

} Tolerance (dB) Moving Stationary Moving Stationary _ cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | lw(30p) .

63 72.3 58.3 11.7 84.0 70.0 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | lw(30p) . 80 69.3 55.0 12.7 82.0 66.7 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | lw(30p) . 100 66.5 49.8 13.5 80.0 63.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | lw(30p) . 125 63.3 45.1 14.7 78.0 60.0 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | lw(30p) .

160 60.3 42.0 15.7 76.0 56.7 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | lw(30p) . 200 57.5 36.8 16.5 74.0 53.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | lw(30p) . 250 54.4 34.7 17.6 72.0 52.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | lw(30p) . 315 51.3 32.6 18.7 70.0 51.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | lw(30p) .

400 48.3 30.6 19.7 68.0 50.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | lw(30p) . 500 45.4 28.7 20.6 66.0 49.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | lw(30p) . 630 42.3 26.6 21.7 64.0 48.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | cw(30p) . 800 39.3 24.6 22.7 62.0 47.3 � | cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | cw(30p) . 1000 36.5 22.8 23.5 60.0 46.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | cw(30p) . 1250 33.3 20.6 24.7 58.0 45.3

cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | cw(30p) . 1600 30.3 18.6 25.7 56.0 44.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | cw(30p) . 2000 27.5 16.8 26.5 54.0 43.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | cw(30p) . 2500 24.4 14.7 27.6 52.0 42.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | cw(30p) . 3150 21.3 12.6 28.7 50.0 41.3

cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | cw(30p) . 4000 18.3 10.6 29.7 48.0 40.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | cw(30p) . 5000 15.4 8.7 30.6 46.0 39.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | cw(30p) . 6300 12.3 6.6 31.7 44.0 38.3 cw(30p) | cw(36p) | cw(30p) | cw(36p) | cw(36p) | cw(30p) | cw(30p) . 8000 9.3 4.6 32.7 42.0 37.3 _

lw(72p) | cw(48p) . Stationary 75 _ lw(228p) .

Tableau 3 [T3.13], p.12

Volume V -- Suppl. No. 13 15

References

[1] CCITT Recommendation Description and adjustment of the reference system for the determination of AEN (SRAEN) , Yellow Book, Vol. V, Rec. P.44, ITU, Geneva, 1981.

[2] CCITT Recommendation Measurement of the AEN value of a commercial telephone system (sending and receiving) by comparison with the SRAEN , Yellow Book, Vol. V, Rec. P.45, ITU, Geneva, 1981.

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Methods used for assessing telephony transmission performance , Supplement No. 2, Yellow Book, Vol. V, ITU, Geneva, 1981.

HOTH (D. | .): Room noise spectra at subscribers' telephone locations, J.A.S.A ., Vol. 12, pp. 499-504, April 1941.

IEC Recommendation Publication 179, Precision sound level meters , 1965.

CCITT Question 24/XII, Contribution COM XII-No. 120, (Noise inside light motor vehicles), Study Period 1981-1984.

CCITT Question 24/XII, Contribution COM XII-No. 134, (Internal vehicle noise spectra), Study Period 1981-1984.

CCITT Question 24/XII, Contribution COM XII-No. 1, (Link with mobile stations), Study Period 1981-1984.

CCITT Contribution COM XII-No. 208, (Comparison of the results of vehicle noise submitted by France and BT), Study Period

1981-1984.

SUBJECTIVE PERFORMANCE ASSESSMENT OF DIGITAL

PROCESSES USING THE MODULATED NOISE REFERENCE UNIT (MNRU)

(Malaga-Torremolinos, 1984; amended Melbourne, 1988)

(quoted in Recommendation P.81)

The primary purpose of this Supplement is to define a specific subjective testing method for evaluating digital processes in a manner such that the quantization distortion effects of these processes on transmission performance can be taken into account in the evolving international telephone network. This implies both the ability to uniquely assign a numerical contribution to each digital process and the ability to use this assigned contribution in conjunction with other impairments to estimate telephone connection performance.

Secondary purposes of the Supplement are to suggest ways in which the subjective test results can be treated to arrive at the assigned impairment level of a particular digital process and how this assigned impairment level can be used in transmission performance analysis.

Two reference scales that have been used for performance assessment of digital processes are a) continuous random noise (additive noise) and b) random noise with amplitude proportional to the instantaneous signal amplitude (multiplicative noise). Random noise with amplitude proportional to the instantaneous signal amplitude in terms of the Q ratio, according to the MNRU as specified in Recommendation P.81, should be used.

The reasons for this proposal are:

1) The signal processed through the MNRU is perceptually very similar in character to the signal processed through various digital processes, thus resulting, in principle, in easier assessment by test subjects, and

2) Considerable experience and information have been accumulated with the MNRU.

16 Volume V -- Suppl. No. 14

Note -- It has not been documented that Q | represents a more suitable reference scale than continuous random noise.

Volume V -- Suppl. No. 14 17

A number of methods are suitable for characterizing the performance of digital processes in terms of Q values. The methods deal with in this Supplement comprise listening-only tests. They are summarized in Table 1.

Other possible methods that may be mentioned are:

	1)		multiple paired comparisons between all systems under test and all reference condition /X
			1/X2 . \| \| Xi/Xj . \| \| Xj/Ri,
	2) These		articulation test of MNRU conditions and digital systems in the same experiment. methods are not described here. H.T. [T1.14] TABLE 1

center box; cw(96p) \| cw(48p) . Indirect comparison with MNRU method DCR method Equality Threshold _ cw(24p) \| cw(24p) \| cw(24p) \| cw(24p) \| cw(24p) \| cw(24p) . { X 1 X 2 � � � X n R i } (SSR) { X 0/X 1 X 0/X 1 � � � X 0/X n X 0/R i } (PC) { X 1/R i X 2/R i � � � X n/R i } (PC) _		Direct comparison with MNRU _ cw(48p) \| cw(48p) \| cw(48p) . ACT
		Table 1 [T1.14], p.

4.1 Background for the Absolute Category Rating (ACR) test method of Annex A

The method is based on a procedure utilized in an experiment conducted by a working group of the IEEE (Institute of Electrical and Electronics Engineers) in which representatives from seven countries participated (Canada, France, Italy, Japan, Norway, the United Kingdom and the United States) [2]. The aim of this experiment was to determine whether comparable results could be obtained when the same test is performed in several countries. Speech samples in the native languages of the participating countries were processed at a central location through 38 communications circuits. The recordings of the processed speech were returned to each country for evaluation on a five-point category rating scale by native listeners.

18 Volume V -- Suppl. No. 14

The communications circuits included 22 references (continuous random noise, MNRU, m-255 PCM) and 16 adaptive differential PCM (ADPCM) systems. (The type of ADPCM system used was a first order fixed predictor [3].) An important part of the data analysis was estimation of the quality of the 16 ADPCM conditions at one location given measurements of ADPCM quality elsewhere.

Results (mean opinion scores [MOS]) obtained at the different locations differed [2]. Nonetheless, analysis of the results indicated that a reasonably accurate estimate of ADPCM quality in country B is the quality measured in country A adjusted by an additive constant.

Changes in the methodology were discussed at an IEEE working group meeting in May 1982 in Paris. The methodology incorporating these changes was recommended by Study Group XII in June 1982 as a basis for evaluating candidate 32 kbit/s algorithms for CCITT standardization as Recommendation G.721 [1]. Subjective tests using the methodology were conducted under the auspices of Study Group XVIII in late 1982 with the results that a codec algorithm was selected and improvements (not related to telephone speech transmission issues) were identified. A second series of subjective tests in late 1983 confirmed that telephone speech transmission performance of the improved algorithm was suitable. (Differences between test results from the different participating organizations were also found in the 1982 and 1983 CCITT tests.)

The preceding discussion should not be taken to indicate that the subjective testing methodology is completely satisfactory: the reasons for differences found between countries [2] and [4] are thus far not explained. Nevertheless the testing methodology has the important feature of having been used by several countries.

4.2 Background for the Degradation Category Rating (DCR) test method of Annex B

A modification of the ACR test method, the DCR test method, is described in Annex B. Based on results from one Administration, the DCR test method provides a greater discrimination between conditions than does the ACR test method [5].

Results from an experiment conducted by another Administration do not support this conclusion [6].

Tableau 2 [T2.13], p.11

H.T. [T3.13]

TABLE 3

Computed sound pressure levels of spectra

center box; cw(72p) | cw(48p) . Spectra

} _ lw(72p) | cw(48p) . Moving 30 km/h

{ Sound pressure level, unweighted (dB SPL)

80 lw(72p) | cw(48p) . Moving 80 km/h

85 lw(72p) | cw(48p) .

Moving 110 km/h 90

The purpose in conducting test of digital processes is to determine their suitability for use in telephone networks. A procedure which has been used is to assign Q values, determined using the reference system of Recommendation P.81, to processes of interest. Various methods of data analysis are possible, but it appears desirable to define a single method to be used in order to assure expressing results in common terms. The provisional method is based on the use of MOS (mean opinion score) values obtained using the procedures of Annex A.

Hypothetical results obtained from a subjective test conducted according to the methodology of Annex A to this Supplement are shown in Figures 1 to 4. (Straight lines are used simply to connect data points.) Generally such results will display a saturation effect at and near the very good conditions (high MOS) and the very bad conditions (low MOS). (For high MOS, the saturation is caused by the 5-point scale and possibly by the idle circuit noise of the subjective test system without added impairments, e.g. idle circuit noise, and codec quantization distortion. For low MOS, the saturation is caused by the 5-point scale.) Experience [2] has shown that due to this saturation effect, acceptable accuracy for the determination of Q is obtained for the range of about 5 dB to 25 dB.

An objective of this analysis is to determine a function Q2 = F (L ) where Q2is the Q | value for the code and L is the line bit rate. One simple method for determining this function uses the MOS values shown in Figures 2 and 3 and can produce a graph of this function as shown in Figure 5. The method is shown in Figure 6, wherein a value of line bit rate is chosen, say L2, and its corresponding MOS value is determined. This MOS value is then used to enter the right hand graph to find the value of Q , in this case Q2, corresponding to this MOS value. Q values for all the other L

values are obtained in a similar way and the resulting set of (Li, Qi) gains are plotted as in Figure 5.

Analysis of test results should include statistical analysis to establish that MOS values obtained are due to the test conditions and not to other factors. Student's test may be suitable, but there is some indication that analysis of variance is more appropriate.

Volume V -- Suppl. No. 14 19

The principles of a method of analysis used by one organization are outlined in Annex D of this Supplement. The method uses analytic values, called fit means , calculated from subjective test results; these analytic values are similar to MOS values calculated from test results. One desirable result of the test is estimates of the Q of the processes tested. Annex D contains a method for deriving such estimates.

Values of MOS versus Q | (as per Figure 2) obtained from actual experiments are given in References [5], [7] and [18] and in Annex B.

Figures 1 et 2 Sup.14, p.14-15 Figures 3 et 4 Sup.14, p.16-17

20 Volume V -- Suppl. No. 14

Figure 5 Sup.14, p.18 Figure 6 Sup.14, p.19

Volume V -- Suppl. No. 14 21

ANNEX A

(to Supplement No. 14)

Absolute category rating (ACR)

method for subjective testing of digital processes

A.1 Introduction

The listening-only test method consists in principle of three parts: preparation of source tapes; processing of the source tapes to obtain stimulus tapes containing the test conditions of interest; conduct of subjective tests using the stimulus tapes. Certain steps may be combined if interchange of source/stimulus tapes between locations is not involved.

The methodology is based on the notion of simulating a connection comprising a sending system, a receiving system and an interconnection system which provides for inserting the impairment of interest (idle channel noise and quantization distortion from the MNRU and from digital processes).

Listener responses in the subjective tests are influenced by a number of sources of variation, e.g. speech material, talker voice characteristics, presentation orders, time effects, etc. Unless controlled in some way, these variables may bias the outcome of the experiment. It is therefore recommended that appropriate experimental design be applied to take this into account. Principles for experimental design may be found in textbooks on statistics.

A.2		Preparation of source tape(s) The recording system consists of a tape recorder, means for injecting calibration tones and a suitably defined sending system.
A.2.1		Tape recorder The tape recorder should be a high (studio) quality two-track machine. The type of equalization should be stated, but IEC is preferred. One

of the tracks is used for recording the speech samples; the second channel is available for other purposes, e.g. cueing tones to allow computer start/stop control of the tape recorder. The tape recorder should be operated at 19 cm/sec.

Low print-through, low-noise tape should be used and the tape should be stored ``tail-out'' so that it is necessary to rewind the tape before it is played.

Note -- The use of an A/D converter and a television cassette recorder should be considered as a means for recording and storing high quality source and test tapes.

A.2.2 Calibration tones

It is recommended that calibration tones be recorded on the source tape(s) to enable checking the sensitivity/frequency characteristics of the connection simulation from input to the source tape recorder to output from the stimulus presentation tape recorder. Tones should be recorded in sequence at 250, 500, 1000, 2000, 3000 and 4000 Hz of 5 seconds duration each, with a level 6 dB below the maximum r.m.s. input level of the tape recorder. These tones should be followed by a 15 second recording of a 1 kHz test tone at maximum r.m.s. input level to enable calibration of the interconnection and listening systems. This should be followed by several metres of leader tape.

A.2.3 Sending system

The sensitivity/frequency characteristics of sending systems of different countries are likely to differ and, thus, results of different countries may differ because of attenuation distortion. Furthermore, the performance of complex digital codec algorithms may be dependent on the shape of the sending system sensitivity/frequency characteristics. Therefore, it is desirable that at least for some of the conditions in a test the sending system characteristic be as given in Table A-1 (simulates the IRS send part without filter).

22 Volume V -- Suppl. No. 14

center box; cw(42p) | cw(42p) | cw(42p) .

H.T. [T2.14]

TABLE A-1

IRS characteristics before adding SRAEN filter

Frequency (Hz) S M J (dB V/Pa) S j e (dB

Pa/V) _ cw(42p) | cw(42p) | cw(42p) .

100 --22.00 --21.00 cw(42p) | cw(42p) | cw(42p) . 125 --18.00 --17.00 cw(42p) | cw(42p) | cw(42p) . 160 --14.00 --13.00 cw(42p) | cw(42p) | cw(42p) . 200 --10.00 -- 9.00 cw(42p) | cw(42p) | cw(42p) . 250 -- 6.80 -- 5.70 cw(42p) | cw(42p) | cw(42p) .

315 -- 4.60 -- 2.90 cw(42p) | cw(42p) | cw(42p) . 400 -- 3.30 -- 1.30 cw(42p) | cw(42p) | cw(42p) . 500 -- 2.60 -- 0.60 cw(42p) | cw(42p) | cw(42p) . 630 -- 2.20 -- 0.10 cw(42p) | cw(42p) | cw(42p) . 800 -- 1.20 + 0.00 cw(42p) | cw(42p) | cw(42p) . 1000 + 0.00 + 0.00 cw(42p) | cw(42p) | cw(42p) . 1250 + 1.20 + 0.20 cw(42p) | cw(42p) | cw(42p) . 1600 + 2.80 + 0.40 cw(42p) | cw(42p) | cw(42p) . 2000 + 3.20 + 0.40 cw(42p) | cw(42p) | cw(42p) . 2500 + 4.00 -- 0.30 cw(42p) | cw(42p) | cw(42p) . 3150 + 4.30 -- 0.50 cw(42p) | cw(42p) | cw(42p) . 4000 + 0.00 --11.00 cw(42p) | cw(42p) | cw(42p) . 5000 -- 6.00 --23.00 cw(42p) | cw(42p) | cw(42p) . 6300 --12.00 --35.00 cw(42p) | cw(42p) | cw(42p) . 8000 --18.00 --53.00 _

Tableau A-1 [T2.14], p.20

It may be desirable to include conditions for which the sending system represents a typical (average) local system according to the testing organization's (country's) network and/or needs. This system comprises a handset telephone set, a simulated physical cable pair, a feeding bridge and a resistive termination (e.g. 600 ohms, 900 ohms) to which the source tape recorder is connected. The telephone set can utilize a linear telephone microphone with a real voice sensitivity/frequency characteristic such that the acoustic-to-electric response of the sending system represents the organization's average local system. It may also be desirable to include conditions obtained with a carbon telephone microphone representative of the type(s) used in the organization's (country's) network. (See Recommendation P.64.) The characteristics (and feeding current) should be reported. It may also be desirable to report the characteristic measured using an artificial sound source. (See Recommendations P.51 and P.64.)

A.2.4 Recording environment

The recording environment should be that of a quiet living room or office. The ambient room noise level should be 25-30 dBA. The noise spectrum should, if possible, have the shape of the Hoth spectrum of Supplement No. 13. Special tests may be required using other noise levels and/or spectral characteristics (e.g. typewriter noise, etc.).

The room noise characteristic should be reported in as complete a form as is possible [e.g. dBA, long-term spectrum, amplitude/time distribution, etc.].

A 30 second recording of the room noise through the local system should follow the calibration tones. This should be accomplished with a talker holding the telphone handset in a normal use manner. (Special precautions may be necessary in order to avoid breath sounds if desired.)

Volume V -- Suppl. No. 14 23

A.2.5 Speech samples

A source tape is made of 4 � C samples (4 talkers, samples consisting of training, reference and test conditions). Each sample should comprise 2 or 3 sentences separated by at least 1 second.

All samples should be different to avoid repetition of sentences during a test. When reporting test results, it may be desirable to provide a list of the sentences used (i.e. 8 � C or 12 � C sentences).

Each sample is expected to be 6-10 seconds in length. The samples should be separated by 5 seconds of silence to allow for control (e.g.

turning the tape recorder on and off) and of the amount of time needed for subjects to vote.

The r.m.s. level of the speech samples (speech power while active) should be 12 dB below the r.m.s. level of the 1 kHz calibration tone in order to avoid peak clipping of the speech samples by the tape recorder and to measure in an easy way the actual r.m.s. level of the speech.

A.2.6 Talkers

At least 4 different talkers (2 female, 2 male) with different voice characteristics should be used. Selection of the talkers will depend on the judgement of the experimenter.

A.3 Preparation of stimulus tape(s)

The interconnection system will consist of the source tape recorder (resistive, 600 or 900 ohms), an input filter, a means for inserting test conditions, an output filter, and the stimulus tape recorder (resistive, 600 or 900 ohms). The characteristics of the filters should be provided.

A.3.1 Test conditions

The test conditions comprise the digital codec(s) of interest. The codec(s) should be defined as simply and completely as possible (e.g. A-law/m-law, ADPCM with first order fixed predictor, etc.). This is to enable unique performance specification for codecs of the same type.

Because codecs may have different performances at different speech input levels, they should preferably be tested not only at a nominal fully-loaded condition, but also at levels below and above this level, say � | 0 dB. These changes in input level to codecs should be ``off-set'' by corresponding adjustments of their outputs to maintain an approximately constant output level for the test. (Listening level may also affect relative performance of different digital processes. See also � A.4.4.)

The codec(s) should be tested singly (one encoding/decoding pair) and with 2, 4 and (possibly) 8 codecs connected in tandem asynchronously. (It may also be desirable to include conditions in tandem synchronously.) The codecs may be hardware or software implemented; if the latter, injected circuit noise expected for practical codecs should be included.

For the single codec(s) conditions, the line bit rate should be the design value and, if possible, line bit rates both above (to ensure subjective saturation) and below (to ensure degraded performance). These conditions may be useful in assigning a performance level(s) to the codec(s). (For example, a nominal 32 kbit/s ADPCM algorithm might also be tested at 16, 24, 40 and 48 kbit/s.)

		The tandem conditions should utilize the codec(s) at the design line bit rate(s). Codec conditions with line errors should be included. Bit error rates covering the range 10DlF2613 to 10DlF2616 should be used.
A.3.2		Reference conditions Reference conditions which should be included are Q \| values within the range 5 dB to 25 dB with a minimum of 4 steps. (It may also be

desirable to include Q values of 0 dB and 30 dB.)

It is desirable to include injected circuit noise values to provide SNRs within the range 5 dB to 45 dB with a minimum of 4 steps. (SNR is the dB ratio of speech power in milliwatts while active to injected circuit noise in milliwatts; the circuit noise conditions should be band-limited by filters having the same characteristics as the filter of the MNRU.) Note that the 45 dB ratio could be dependent on the inherent system noise, e.g. noise from the source tape preparation process, noise from the source tape recorder, etc.

24 Volume V -- Suppl. No. 14

Source conditions should also be included. (These are obtained by removing the injected idle circuit noise.)

The purpose of including the injected circuit noise conditions is to enable the relating of test results to results available on the effects of loss and circuit noise (Question 4/XII) and to allow use of the test results in subjective opinion model studies (Question 7/XII).

Other reference conditions can be included at the discretion of the testing organization. For example, particular organizations may have available information from previous tests of A-law/m-law companded PCM, and it may be desirable to include some PCM conditions to allow comparison with previous results.

A.3.3		Calibration The insertion loss of the interconnection circuit should be 0 dB at 1 kHz between the resistive source/termination. This should apply for the
better		conditions e.g. Q = 25 dB, SNR = 45 dB and the test codec(s) operated at design line bit rate(s). The r.m.s. level of the 1 kHz calibration tone at the input to the inter-connection circuit should be 3 dB below the codec(s) overload level

(which should be quoted). This will ensure that r.m.s. level of the speech samples will be 15 dB below the codec(s) r.m.s. sinewave overload level.

With the above calibration, the injected circuit noise levels in dBm across the output resistive termination should be adjusted to an appropriate level relative to the output 1 kHz calibration tone level in dBm. Note that in particular the circuit noise impairment should be present during the speech sample idle periods but not before and after the speech sample.

The stimulus tape recorder calibration should be the same as that for the source tape recorder.

A.3.4 Stimulus tape(s)

Stimulus tapes should begin with the 1 kHz calibration tone recorded (without introduced impairments), 12 practice conditions and then the test and reference conditions.

The practice conditions should be selected to introduce the test subjects to the test format and range of speech quality. These conditions should consist of each of the four talkers with 3 practice conditions.

The basic test and reference conditions will be 4 (i.e. number of talkers) times the number of nominal conditions. These conditions should appear in random order. There should be at least 2 stimulus tapes with different random orders. (These could be used in different tests with different subject groups.)

It may also be desirable to include replication of at least some of the test/reference conditions. However, this may not be possible for a practical subjective test size.

The timing of conditions in the stimulus tapes is the same as that for the source tapes, e.g. approximately 6-10 seconds (2 or 3 sentences) with each condition separated by 5 seconds of silence.

The calibration tones on the source tape need not appear on the stimulus tape (except for 1 kHz calibration tone as noted above). However, the calibration tone levels should be measured at the interconnection system output resistive termination so that the system sensitivity/frequency characteristics can be measured and reported for all condition types.

A.4 Testing procedure

A.4.1 Listeners

The preferred number of listeners is 32, assigned equally to each tape. At least 12 test subjects should be used. It is desirable that the subjects be selected to represent the typical customer population (e.g. half of the group females and half males, ages approximating the population distribution of ages, normal hearing, etc.).

Volume V -- Suppl. No. 14 25

A.4.2 Listening system

For reasons given in the first paragraph of � A.2.3, the receiving system characteristic should be as given in Table A-1 (simulates the IRS receive part without filter).

It would be desirable if the listening system simulated the organization's typical (e.g. average) local system representing the central office source impedance, feeding bridge, physical cable pair and the handset telephone set. The electric-to-acoustic sensitivity/frequency characteristic of the listening system should be determined (see Recommendation P.64). Sidetone in the listening system should be suppressed.

A.4.3 Listening environment

The listening handset(s) should be located in a room with an ambient room noise level 40 dBA, preferably 25-30 dBA (simulating a quiet office or living room). The noise spectrum should, if possible, have the shape of the Hoth spectrum of Supplement No. 13. The actual ambient room noise level and spectrum, if different from the above, should be reported.

A.4.4 Speech level

The 1 kHz calibration tone on the stimulus tape when played through the listening system should be adjusted such that reproduction occurs at a level of --3 dBPa as measured with the artificial ear recommended by the CCITT. (See Recommendation P.51.) This will result in a speech level of about --15 dBPa which is close to the preferred level. It may also be desirable to include conditions with a 10 dB lower level and 10 dB higher level since the listening level may affect the relative performance of different digital processes.

A.4.5 Test instructions

Test subjects will be provided with a written set of instructions which will also be read to them (either by the test administrator or by means of a tape recording). The instructions should be given before the practice conditions. Subjects should not be instructed that the practice conditions represent the full range of quality to be encountered in the test. After the practice conditions, there should be sufficient time allowed for answering possible questions by the subjects.

The subjects should be instructed to rate the conditions according to the five point quality scale as follows:

Score Quality rating 5 Excellent 4 Good 3 Fair 2 Poor 1 Bad

In countries for which English is not the native language, the appropriate terms in the native language should be used.

Before the listening test is conducted, it is necessary to carry out practice sessions to ensure full adaptation of listeners to the test conditions and obtain a stable evaluation.

There is some indication that a speech level of --5 dBPa (1 kHz tone level of +7 dBPa) would be more suitable
than --15 dBPa for discrimination between coder conditions.

26 Volume V -- Suppl. No. 14

A.4.6 Data collection

Subjects' responses can be recorded by computer, on paper or by such other means as are appropriate. If paper and pencil are used, the response to each condition should be recorded on a separate card so that the subject is not looking at a previous opinion while making a new judgement.

A.5 Results reports

Reporting all of the raw data may be desirable but results in excessive documentation. Therefore, it may be appropriate to combine data across talkers and report the number of ratings in each of the 5 categories for each condition type, e.g. Q = 15 dB, SNR = 25 dB, etc. (Conclusions resulting from an analysis of the study of possible talker effects should be reported.) In addition, mean opinion scores (MOSs), standard deviation, 95 percent confidence intervals and other statistics computed by the organizations in analyzing the data should be reported.

Other items which should be reported are as follows:

a) microphone type;

b) sensitivity/frequency characteristic of the sending system (Recommendation P.64);

c) description of recording room and ambient noise levels;

d) measurement and adjustment procedure for speech levels;

e) sensitivity/frequency characteristics of the interconnection system for all test/reference condition types;

f ) sensitivity/frequency characteristic of the listening system (Recommendation P.64);

g) description of the listening room and ambient noise level;

h) method of recording test subject opinions;

i) description of subject group including age, sex, population, prior experience and, if possible, audiometric threshold;

j ) handset dimensions.

Bibliography for Annex A

KIRK (R. | .): Experimental design procedures for the behavioral sciences, Brooks/Cole Publishing Company , Belmont California, 1968.

CCITT Recommendation P.64.

CCITT Recommendation P.74.

ANNEX B

(to Supplement No. 14)

Subjective perfomance assessment of digital encoders

using the degradation category rating procedure (DCR)

(Contribution of the French Administration)

B.1 Introduction

A listening-only test method has been drafted by CCITT SG XII to assess the subjective quality of digital encoders (see Annex A). This procedure, Absolute Category Rating test (ACR), leads to a low sensitivity in distinguishing among good telephone quality coders (within the range of quality of 6-8 bit PCM coders). If higher sensitivity is needed we propose to use a modified version of that procedure, which can be defined as a Degradation Category Rating test (DCR). For image testing CCIR [6] recommends two alternative methods, absolute category ratings and degradation category ratings. The DCR procedure, which in particular uses an annoyance scale and a high quality reference before

Volume V -- Suppl. No. 14 27

each judgement, seems to be suitable for evaluating good quality images. Therefore this method has been adapted to evaluate speech quality. 28 Volume V -- Suppl. No. 14

This Supplement first describes the adaptation of the DCR procedure to speech. Then the sensitivity of the method is compared with that of the ACR procedure on the same circuits. Only the differences between ACR and DCR procedure are presented here. One can refer to Annex A for common points which are not covered in this Annex.

B.2 Degradation category rating procedure (DCR)

B.2.1 Speech samples

Each configuration is evaluated by means of judgements upon four talkers reading two different samples. Each sample should comprise two sentences separated by at least one second. These two samples (S1, S2), hence four different sentences, should be selected from a wider corpus composed of phonetically balanced sentences so that the mean score obtained in evaluating MNRU circuits for these four sentences is about the same as that obtained for the wider corpus. Therefore the corpus consists of eight samples defined as follows:

talker T1 reading samples S1, S2

talker T2 reading samples S1, S2

talker T3 reading samples S1, S2

talker T4 reading samples S1, S2.

This results in a repetition of the two samples during the test. But we feel that this is not so critical for the procedure where a degradation is evaluated with regard to a reference, especially for good telephone quality where the intelligibility of speech is nearly perfect. The use of different samples for each configuration as is done in ACR experiments could be one of the reasons for this procedure's lack of sensitivity.

B.2.2 Reference conditions

Reference conditions should include multiplicative noise with Q | values within the range of 10 to 30 dB with a minimum of four steps. (It may also be desirable to include Q values of 5 dB and 35 dB).

A high quality reference should be chosen to be inserted before each judgement. Usually source conditions are used, i.e. samples with no more degradation than those introduced by sending systems and limitations of frequency bandwidth. Four ``null pairs'' (A-A) are included to check the quality of anchoring of the listeners' judgements.

B.2.3 Stimulus presentation

The stimuli are presented to listeners by pairs (A-B) or repeated pairs (A-B-A-B) where A is the high quality reference sample and B the same sample processed by a codec. The purpose of the reference sample is to anchor each judgement of the listeners. Using a reference and subjective judgements with respect to that reference is quite a common procedure in psychoacoustics. It tends to result in a good sensitivity for the overall evaluation by listeners. Samples A and B should be separated by 0.75 s and in a repeated pair procedure (A-B-A-B) the separation between the two pairs should be 2 s.

It seems that the classical order effect observed in a one-sample listening test (ACR for example) is not observed with the DCR procedure. Thus, only one random order of presentation can be used. Therefore the basic test and reference conditions will be eight times (four talkers � two samples) the number of nominal conditions.

The timing for the response of listeners is the same as for the ACR test, i.e. 5 s between each presentation (pair or repeated pairs).

B.2.4 Test instructions

The subjects should be instructed to rate the conditions according to the five point degradation category scale as follows:

5 -- Degradation is inaudible

4 -- Degradation is audible but not annoying

Volume V -- Suppl. No. 14 29

3 -- Degradation is slightly annoying

2 -- Degradation is annoying

1 -- Degradation is very annoying.

30 Volume V -- Suppl. No. 14

B.3 Comparison between the sensitivity of an ACR and a DCR procedure for the same coder configurations

Tables B-1 and B-2 summarize the results obtained with ACR test and DCR test respectively for the evaluation of three 32 kbit/s ADPCM algorithms.

Figures B-1, B-2 and B-3 show the mean opinion score (MOS) and degradation mean opinion store (DMOS) obtained by the same conditions with the two procedures (ACR and DCR respectively).

From these figures one can note:

judgements

a good agreement between the results obtained with the two procedures;

a larger spread of the DMOS obtained for MNRU circuits with Q values ranging from 10 dB to 35 dB, and a good anchoring of the

of listeners (``null pairs'' have obtained a score of 4.98);

a higher sensitivity of the DCR procedure in the range of good telephone quality (20 < Q < 35 dB).

These sensitivities can be quantified by means of a statistical multiple comparison test. When an a posteriori comparison of codecs is needed a Tuckey [7] honestly significant difference (HSD) test can be applied effectively. The HSD test is designed to make all pairwise comparisons among the means and to determine the significance of the differences in the mean values. Under identical conditions (a = 0.01, k

= 2, N = 225, fixed mode) the HSD limit value (q a,k,N ) is 3.70 and since the residual errors for ACR and DCR procedures are about the same (0.42), two means can be declared as significantly different if:

D = |

Xi |

| (em

Xj |

| > | .21

This difference, expressed in Q value, corresponds to:

H.T. [T3.14]

center box; cw(54p) \| cw(54p) \| cw(54p) \| cw(54p) . Range in Q (dB) ACR test D Range in Q (dB) \| cw(54p) \| cw(54p) . 15 -- 20 1.48 15 -- 20 1.07 cw(54p) \| cw(54p) \| cw(54p) \| cw(54p) . 20 -- 25 cw(54p) \| cw(54p) \| cw(54p) . 25 -- 30 3.00 25 -- 30 1.36 _		DCR test D _ cw(54p) \| cw(54p) 1.87 20 -- 25 1.14 cw(54p) \|
		Tableau [T3.14], p.21

This means that the resolution of the DCR test may be twice that of the ACR test in terms of Q value in the range of good telephone quality.

B.4 Conclusion

A good agreement between the results obtained with the two procedures (ACR and DCR) has been found. The presence of a reference before each judgement for the DCR procedure ensures a good anchoring of the listener's rating and consequently a larger spread of the degradation mean opinion score (DMOS) obtained by the coders. The evaluation of the coders based on the same speech samples leads to a better precision for the DCR procedure at a price, of course, of a decrease of the importance of the effort made to comprehend the samples in the overall quality judgement. Therefore the degradation category rating procedure seems well adapted to evaluate good telephone quality coders.

Volume V -- Suppl. No. 14 31

H.T. [T4.14]

TABLE B-1

Mean opinion scores (MOS) and 95% confidence intervals (INT)

for ACR test

center box; cw(48p) | cw(30p) sw(30p) | cw(30p) sw(30p) | cw(30p) sw(30p) , ^ | c | c | c | c | c | c. Test conditions X Y Z

MOS INT MOS INT MOS INT _ lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . PCM 3.81 0.45 3.89 0.13 4.16 0.13 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . PCM 2A 3.99 0.13 4.10 0.13 3.90 0.14 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . PCM 4A 3.35 0.12 4.02 0.14 3.70 0.14 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . PCM 8A 3.39 0.14 3.48 0.14 3.46 0.12 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . PCM 10DlF2614 3.31 0.15 3.55 0.14 3.15 0.16 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . PCM 10DlF2613 1.90 0.15 1.78 0.13 2.10 0.17 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . PCM + 10 dB 3.94 0.15 4.02 0.12 4.14 0.11 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . PCM -- 15 dB 3.49 0.16 3.60 0.14 3.41 0.16 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 3.60 0.15 3.41 0.13 3.65 0.12 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 2A 3.72 0.13 3.30 0.12 3.38 0.13 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 4A 3.14 0.13 2.85 0.13 2.63 0.13 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 8A 2.51 0.14 2.09 0.14 2.23 0.15 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 2T 3.77 0.12 3.33 0.13 3.42 0.13 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 4T 3.86 0.14 3.01 0.14 3.80 0.13 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 10DlF2614 3.54 0.11 3.28 0.12 2.81 0.15 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 10DlF2613 2.88 0.16 2.55 0.15 1.93 0.13 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM + 10 dB 3.80 0.14 3.55 0.14 3.61 0.13 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM -- 15 dB 3.20 0.15 3.02 0.15 2.92 0.14 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM, C 2A 2.44 0.16 2.62 0.16 2.23 0.14 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM, C 4A 2.13 0.15 2.14 0.13 1.90 0.13 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM, C 8A 1.98 0.14 1.84 0.13 1.59 { 0.12 S/N 40 3.52 0.15 S/N 35 3.18 0.17 S/N 25 2.04 0.15 S/N 15 1.23 0.09 Q 10 1.41 0.10 Q 15 2.34 0.11 Q 20 3.04 0.10 Q 25 3.61 0.09 Q 30 3.96 0.09

}

Note 1 -- Votes combined across four speakers and two sentences.

Note 2 -- Number of votes = 128 except for Q where N = 256.

		Tableau B-1 [T4.14], p.22
BLANC 32 Volume	V -- Suppl. No. 14

H.T. [T5.14]

TABLE B-2

Degradation mean opinion scores (DMOS) and 95% confidence intervals

(INT) for DCR test

center box; cw(48p) \| cw(30p) sw(30p) \| cw(30p) sw(30p) \| cw(30p) sw(30p) , ^ \| c \| c \| c \| c \|				c \| c. Test		conditions X Y Z
DMOS INT DMOS INT DMOS INT _ lw(48p) \| cw(30p) \| cw(30p) \| lw(30p) \| PCM 4.35 0.10 4.41 0.11 lw(48p) \| cw(30p) \| cw(30p) \| lw(30p) \| lw(30p) \| 8A 3.48 0.16 3.33 0.15 lw(48p) \| cw(30p) \| cw(30p) \| lw(30p) \| lw(30p)		\|		lw(30p) \| cw(30p) cw(30p)		cw(30p) \| cw(30p) . \| cw(30p) . PCM \| cw(30p) . PCM

10DlF2613 2.21 0.11 2.25 0.14 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 4.33 0.11 4.22 0.11 4.05 0.12 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 8A 2.63 0.14 2.35 0.14 2.38 0.17 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 10DlF2613 3.14 0.16 2.83 0.14 1.85 0.14 lw(48p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) | cw(30p) . ADPCM 4T 4.29 0.10 3.69 0.14 4.09 { 0.13 Q 15 1.99 0.15 Q 20 2.97 0.17 Q 25 3.89 0.18 Q 30 4.66 0.10 Q 35 4.81 0.09 Origin 4.98 0.03

}

Note 1 -- Votes combined across four speakers and two sentences.

Note 2 -- Number of votes = 128.

Tableau

Volume V -- Suppl. No.

Tableau B-2 [T5.14], p.23

Figure B-1, p.24

14 33

Figure B-2, p.25 Figure B-3, p.26

34 Volume V -- Suppl. No. 14

ANNEX C

(to Supplement No. 14)

Threshold method for direct comparison of digital encoders

with a modulated noise reference unit (MNRU)

C.1 Introduction

By direct comparison of a digital system with an MNRU it is possible to assess the Q value which equals the performance of the system under test. The method described here leads to a threshold of equality defined as the 50% preference level between the MNRU and the digital system.

The threshold method is expected to give stable and precise results even for high quality digital processes. For wideband digital encoders the use of a wideband MNRU as described in Annex A of Recommendation P.81 is recommended.

C.2 Testing procedure

A listening-only test procedure is used. A signal pair consisting of a reference signal and a test signal is presented to listeners, who are then asked to indicate which of the signals in the pair they judge to have the highest quality ( preference rating ). Subjective equivalent SNR (Q ) is defined as the reference SNR corresponding to the intersection point of the regression curve of the preference scores at the 50% preference level. An example of Q obtained with hypothetical preference scores is shown in Figure C-1.

C.3

Presentation of signals

Figure C-1, p.

Reference signal A and test signal B are arranged in an equal number of A-B pairs and B-A pairs, and presented in random order. Several distortion levels spaced, for example, at 2 dB intervals, are introduced to the reference signal so that the range of preference scores extends from 20% to 80%, where the 50% preference lies in the middle of the distortion range. A timing diagram of the presentation is shown in Figure C-2.

Volume V -- Suppl. No. 14 35

Figure C-2, p.

The subject is required to make a judgement and respond by saying ``A is better'' or ``B is better'' (forced choice). The response ``A equals B'', or ``No difference'' is forbidden. The duration of the presentation should be limited to about six minutes in order not to tire the listeners. More listening samples may be presented after a suitable rest period. At least two, preferably four or five replications (repetitions of identical presentations) are recommended.

Note -- If the MNRU is available in hardware and the SNR can be easily changed between presentations, a simplified procedure can be used. In this case the balancing to equally perceived quality is done by the subject. The adjustment is made during the pause between the pairs. The reference is always presented first. Presentation continues until the subject reports that the equality threshold has been reached.

C.4 Speech sources

It is necessary to use short sentences spoken by at least two males and two females, preferably four or six of each; different sentences are required for each speaker. The duration should be 2.5-5 seconds for speech and less than 10-15 seconds for music signals. Clicks at the beginning and end of the samples must be avoided. A linear microphone of sufficient bandwidth should be used to record the source signals in a sound-absorbent room having an ambient noise of less than 20 dBA and a reverberation time of less than 0.3 seconds in the band 125-8000 Hz. If digital recording equipment is used, the quantizing noise level should be less than the noise level in 14-bit linear PCM.

C.5 Listening environment

A high-fidelity sound reproduction system should be used for the listening test. When listening is carried out with loudspeakers, the reproduction equipment should be of studio-quality and the listening room should conform to CCIR Report 797. If headphones are used, diotic (binaural) listening is preferable. The bandwidth should be at least as wide as that of the digital system under test.

C.6 Listeners

The critical distance can be expressed as: D
c= 0.056
@ sqrt { { fIV } over { fIT (see ISO 35u)R } } @ meters
where V is the volume of the room in cubic meters,

center box; cw(72p) \| cw(36p) \| cw(36p) \| cw(36p) . Conference room distance _ lw(72p) \| cw(36p) \| cw(36p) \| cw(36p) . { Small room (60-300 m2) moderate room treatment \| ua)	Omnidirectional microphone	Directional microphone	Critical
} 0.3 0.5 0.6 _ lw(72p) \| cw(36p) \| cw(36p) \| cw(36p) . { Large room (300-1000 m2) some room treatment \| ua)
} 0.6 0.9 1.2 lw(72p) \| cw(36p) \| cw(36p) \| cw(36p) . { considerable room treatment \| ua) } 0.9 1.4 1.8

} _ lw(84p) \| cw(48p) \| cw(48p) \| cw(48p) . Japan lw(84p) \| cw(48p) \| cw(48p) \| cw(48p)A-weighted. NTT		+1.7+3.0corrected_ lw(84p) \|
cw(48p) \| cw(48p) \| cw(48p) . Average +0.7 \| ub)
+2.3 +10.3

(D-1) where Q
s L	= Q \| alue for the codec quantization distortion , = Line bit rate (e.g. in kbit/s).

	} lw(60p) \| cw(60p) \| lw(60p) . Conference room for 20 people 40 } lw(60p) \| cw(60p) \| lw(60p) . Conference room for 10 people 45 } lw(60p) \| cw(60p) \| lw(60p) . { Conference room for 6 people } 50 { Satisfactory for conferences at tables 1.0-1.5 m in length } _		{ Quiet, satisfactory for conferences at tables 4.5 m in length { Satisfactory for conferences at tables 1.5-2.5 m in length
			Table 1 [T1.16], p. Volume V -- Suppl. No. 16 43