To meet the ITU-T have been proposed . The limitation of PEAQ to a maximum requirements, the time alignment of PAMS was integrated with of two channels has been addressed by the development of an the PSQM perceptual model, including improvements such expert system to assist with the optimization of multichannel as partial frequency response equalization, a simple masking audio systems . The combined B. The average correlation between PESQ —, , , , , , , —. The average correlation for eight unknown tests used speech conditions.
For an extended data set Following a competition run by ITU-T in —, where of 40 subjective tests including the training and validation sets, several approaches were compared, a simple approach known as RIX et al. PESQ has been criticized, extended, or improved by several authors.
The effect on PESQ of measurement con- However, the lack of reference means that the compensation of ditions such as signal level has also been studied, and it has variability in speech caused by different speakers and different been noted that measured quality drops significantly if too utterances is quite limited in nonintrusive models compared to high or low a level is used or if the signal spectrum is poorly intrusive models. Finally, experiments have been conducted with PEAQ, together with the growing need to monitor the speech quality of PESQ and other models to change their input filter banks, in-service networks, where intrusive models cannot be applied which are all roughly based on the Bark scale, to the equivalent as reference speech signals uttered by end users are not con- rectangular bandwidth ERB scale, though with PESQ this trolled and may not be available to an objective model.
The first nonintrusive signal-based model in the literature was proposed in by Liang and Kubichek , and the approach C. Extension of Speech Quality Models that it uses, to estimate the difference between the measured Following the standardization of PESQ, work has continued signal and some ideal space of speech signals has been followed to extend the scope of intrusive assessment beyond traditional by several other authors. In this model, reference centroids are telephony speech quality. A wideband version of PESQ, re- first trained from the perceptual linear prediction PLP coeffi- placing its telephone handset input filter with a simple high-pass cients  of nondegraded speech signals, and then the time-av- filter, has recently been standardized by ITU-T for assessment eraged Euclidean distance between degraded PLP coefficients of wideband speech 50— Hz .
Models such as PESQ and the nearest reference centroid is calculated as an indica- have been considered for use in assessing the quality of noise re- tion of speech quality degradation. Various distortion measures duction algorithms, which pose an interesting problem because commonly used in vector quantization VQ were explored to their complicated processing can improve or degrade quality de- improve the performance of model , and an approach based pending on the signal conditions and subjective measurement on hidden Markov model HMM was also proposed .
Re- method, and also it is not clear what reference signal should be cently, the idea to measure the deviation of degraded speech used. New objective models have been proposed to address this, from the statistical model trained on clean speech was expanded by taking as inputs not only the clean noise-free original and by Falk et al. In ad- noisy original signal prior to the noise reduction process , dition to the clean reference speech signals, degraded speech .
These new models estimate P. This is important because the acoustics and signal pro- mechanism of the human vocal tract. Gray considered a model cessing in the handset can have a substantial effect on overall based on the parameterization of a vocal tract model which quality. To allow acoustic measurements to be made, a head- is sensitive to telecommunication network distortions . Beerends and Hekstra also proposed a model based on the Beerends, Berger, Goldstein, and Rix collaborated on a submis- integration of a speech production model for detecting signal sion known as acoustic assessment model AAM , which ex- parts that cannot be produced by human vocal tracts and the tended PESQ with improved level, time and frequency response PESQ intrusive model for estimating the impact of those signal alignment, temporal and frequency masking, and a binaural cog- parts .
The model offered improved performance com- In contrast to the direct utilization of a speech production pared to PESQ for assessment of telephone networks at dig- model, Kim proposed an auditory model for nonintrusive ital or analog electrical interfaces, particularly in the worst case quality estimation ANIQUE in which both peripheral and , .
The modulation less simultaneously with the voice signal, impairments caused spectrum is then related to the mechanical limitation of speech by delay and, the equipment impairment factor representing production systems to quantify the degree of naturalness in impairments caused by low bit rate codecs and errors such as speech signals , . A method has been standardized by ITU-T for ITU-T held a competition from to to standardize estimating equipment impairment factors using subjective tests a nonintrusive signal-based model.
In terms of network condi- or objective models such as PESQ . However, several of the simplifying assump- with talkers speaking in noisy conditions. Two proposals were tions on which it is based—for example, linearity and order submitted, with the ANIQUE model  being narrowly beaten independence—are known to be wrong in some circumstances.
SEAM , which was based on three different models including Nevertheless, by measuring certain parameters on the voice those of Gray and Beerends and Hekstra , . While the current E-model applies only to telephone In P. Based on systems , . Parametric Quality Measures of Specific Network Types a linear combination of intermediate speech quality with 11 ad- For traditional telecommunications networks that are subject ditional signal features.
This is lower than the correlation nections, round-trip delay, noise, and changes to the speech of PESQ over the same data set 0. In-service nonintrusive measurement devices INMDs lack of reference available to SEAM, but does indicate that the allow these network parameters to be measured, typically at model has good correlation with subjective test data. While trunk or international switching centers. Two models have good progress has been made in nonintrusive assessment in been standardized to allow these objective parameters to be recent years, there is clearly still scope for improvement and used to estimate conversational MOS: the E-model and the call field experience of using this model.
The E-model approach combines the measured parameters with a set of default assumptions, using V. CCI contains a Computational models have been widely used for many functional mapping specifically derived to compute MOS-CQO years for planning telecommunications networks without con- from INMD measures of the speech and noise levels, echo loss, ducting subjective tests. The approach has more recently been and delay.
Subjective test data is used to train a func- systems where the dominant distortions, packet loss, jitter, and tional mapping from the objective parameters to MOS. The re- the codec, can be accurately modeled by a small number of sultant mapping is only applicable to the specific network types statistical measures. It is also possible to use an intrusive model instead of subjective A. E-Model tests, although systematic inaccuracies in the intrusive model The E-model is a telecommunication transmission planning will be reflected in the parametric model.
Examples of this ap- model that was originally developed by ETSI for predicting the proach are given in , . The E-model One application of recent interest is to estimate the quality presupposes that all parameters of the voice link that has to be of in-service wireless networks from parametric measures of assessed are known. In this glass box approach see Fig. A challenge here is to affect the conversational quality. Within the telecommunication gather enough data to adequately model the very wide range of industry, a large set of commonly found contributing factors, types of error that can occur in current mobile channels, with such as loudness, background noise, low bit-rate coding dis- variation in the data rate, channel signal-to-noise ratio, forward tortions, packet loss, delay, echo, etc.
The primary output of the E-model is a quality rating factor R on a 0— scale. An invertible mapping exists between R and C.
Voice and Speech Quality Perception: Assessment and Evaluation / Edition 1
Training using large numbers of subjective tests use of compression as low as 5. Loss of packets due can allow models to improve on this a little, but the cost of sub- to network loss and delay jitter is particularly important to jective tests means that models are usually trained with at most network operators because it is load-dependent and difficult to a few tens of tests.
As a result, accuracy for a given condition characterize, even in networks that use traffic management. In practice, this seldom Two approaches to parametric VoIP quality monitoring have presents a problem because even expert listeners struggle to been proposed. To allow real-time monitoring in low-power distinguish differences in quality of this magnitude in an ACR edge devices and on network trunks that may carry very large context.
As a result, for the most critical assessments protocol parameters.
Audiovisual quality integration for interactive communications
From estimates jective testing are highly advisable. As a of the inputs to the E-model . Clark has also proposed mod- result, important conditions that can cause accuracy problems ifying the E-model to take account of time-varying perception with a given model are often already known, and it is advisable of quality, which is discussed in Section VI. These problems may be due to issues in subjective test are large variations between VoIP devices in the implementa- data—for example, if test results strongly conflict—or system- tion of jitter buffers and error concealment.
Voice and Speech Quality Perception - Assessment and Evaluation | Ute Jekosch | Springer
He has developed a atic bias with a specific type of distortion or test signal, and they proprietary model, based on multiple parameters extracted from can result in errors as large as 0. This is achieved by making clude automatic diagnosis of certain failure conditions, which thousands of intrusive speech quality measurements of the de- can assist the operator, but this is not a substitute for reading the vice under test, using a network emulator to vary the operating standard and its associated guidelines, or for listening to some conditions, and then training a numerical model to predict the of the conditions under test , .
PESQ scores for each condition . The accuracy and repeatability of intrusive models depend to Both of these models were presented to an ITU-T process to a great extent on how they are used, and mistakes remain rela- standardize a VoIP parametric model. No winner was selected; tively common. Previously known under the level and spectral content—for telephony, this usually involves working title of P. VTQ voice transmission quality , this is ex- prefiltering recordings with a send filter like the IRS—other- pected to recommend a method of performance assessment sim- wise, they may cause codecs to behave poorly due to clipping ilar to the calibration process of .
Third parties could then or increased coding distortion, or lift the noise floor . Speech use this performance assessment method to determine the accu- signals should contain both speech and silent intervals and have racy of a VoIP objective model. Audio signals must be clean and VI. Use of Objective Models ical material, or a combination of both. For example, as mentioned above, the scope signal content. The idealized 20—20 Hz. Furthermore, the current E-model is being ex- conditions of subjective tests are a long way from the real signals tended towards the use of wideband speech signals 50— obtained from measurement points in live networks, terminals, Hz .
In particular, care needs to be taken with wideband speech databases, and it was approved on the if signals could have nonoptimal levels or spectral content, or basis of near-transparent quality audio , while the mass- where the signals are captured upstream of echo cancellers and market use of audio codecs that has evolved in subsequent years as a result may contain high levels of network or acoustic echo. Although at present there are no plans for a replacement to this model, recent research indi- B. Time-Varying Quality cates that improvements are possible , and some of the ideas Most of the models introduced above have focused on esti- in PEAQ could prove useful for P.
Other active areas of mating short-term MOS, typically measured in subjective tests research are the quality assessment of multi-channel audio e. This timescale is long enough for subjects to 5. This may be due in However, a typical duration of a phone call or audio track is on part to the difficulty and cost of conversational subjective tests, the order of 2 min, and several researchers have studied how or to the improvement of echo cancellers which can minimize timescales may affect quality perception.
This lack of Gray found weak evidence, using 30 s speech paragraphs, that research is a concern because most telephone systems are used the first part of the speech sample had greatest weight on overall conversationally. A concept for intrusive conversation quality MOS, and termed this the primary effect .
Nonintrusive measurement methods, both signal-based and Several more recent tests have shown weak or no evidence for parametric, are relatively new, with the first generation of either the primary or recency effects , . As these tests models designed for network assessment only emerging in differed substantially in their design, but none was large in terms the last five years.
As these methods start to become widely of number of subjects, speech samples or processing conditions, used, it is highly desirable that experience on their strengths it cannot be said that there is clear evidence for either primary and weaknesses is published.
The practical use of nonintrusive or recency effects. This suggests that it is echo canceller artifacts remains challenging. A recent study by Raake indicates that poor plied to video quality assessment. Initial models for combining worst-case quality does have a significant impact on long-term measures of audio and video quality to give an overall audiovi- quality perception .
The use of mean and worst-case mea- sual MOS have been published, but they are based on limited sures together avoids the possibility that periods of sustained data, and this remains a fruitful area for further research in both poor quality could be outweighed by good quality—an issue subjective and objective domains. Moore, An Introduction to the Psychology of Hearing, 4th ed. Norwell, MA: Academic, The new wideband  S. Quackenbush, T. Barnwell, III, and M. Objective Listening Quality Assessment.
Liang and R. Gray, M. Hollier, and R. Falk and W. Audio, Speech, Lang. Image Sig. Schroeder, B. Atal, and J. Speech Audio Process. Beerends and J. Audio Eng. Dau, B. Kollmeier, and A. Spectral and temporal integra- COMD Hansen and B. Audio, Speech, Language Process. Werner, T. Junge, and P.
Recommended for you
Tan, N. Zacharov, and V. Huber and B. Speech Audio pp. Finland, Paillard, P. Mabilleau, S. Morisette, and J. Colomes, M. Lever, J. Rault, and Y.
Appel and J. Thiede and E. Areas Commun. COM Audio Eng. Thiede, W. Treurniet, R. Bitto, C. Schmidmer, T.
Sporer, J. Beerends, C. Keyhl, G. Stoll, K. Brandenburg, and B. Speech Qual. Wang, A. Sekey, and A. George, S. Zielinski, and F. Audio, Commun. Speech, Lang. Barbedo and A. Beerends, A. Hekstra, A. Rix, and M. Zielinski, F. Rumsey, R. Kassier, and S. De and P.
Yang, M. Dixon, and R. IEEE pp. Speech Coding Telecom. Hollier, M. Hawksford, and D. Park, S. Ryu, Y. Park, and D. Vision, Image, Signal Process. ICSLP , , vol. Falk, Q. Xu, and W. Rix and M. Rix, M. Hollier, A. Hekstra, and J. Beerends, P. Oudshoorn, and J. COM C4. Rix, A. Bourret, and M. The results are that Mel-SD performance is better than Mel-CD but its structure change of the filter bank is robust, performance, Mel-CD filter number is good in less than When the actual use of Mel-CD, the number of filters can not be too large, in order to ensure the accuracy and availability of objective evaluation.
Although, Mel-SD performance have robust in the structure change of the filter bank but because of its performance is the best value at 10, so, a smaller number of filter banks should also selected in actual use, both to ensure the performance but also to reduce the computational complexity. Based on the above testing and analysis, Mel-CD and Mel-SD in actual use are choosing smaller filters number of filter bank.
From the analysis of the test results, the two measure performance between are better than relative filter changes. For Mel-CD and Mel-SD, when both are 10 in the number of filters, the average in tests is to achieve the best performance, the best number of filters is Relationship between Mel-SD and compression transform factor: In Mel-SD, the literature Chen and Jin, chooses the cube root function as the voice intensity-the relationship between perceived loudness feature.
This relationship is approximate expression of static measurements in psychoacoustic experiment conclusion. For the objective evaluation of voice quality, it is involving dynamic change of voice, thereby it is generating two questions:. On the basis of the number design of optimized filter, we research the relationship between compression transform and Mel-SD performance. A power function is selected as a compression function, the exponent is called as the compression factor, it is requiring less than one. According to the experimental knowledge and experience, we will vary the compression factor, it is set at 0.
Test conditions are the same section, the average value of eight test performance is used as a comprehensive evaluation performance. The number of filters is 10 in Fig. For comparison and description while the figure shows that when the number of filters is 10, performance Mel-CD is used as a performance benchmark, since, the Mel-CD is regardless of changes in the compression factor, it appears as a straight line in the Fig.
From Fig. Overall, Mel-SD change in the compression factor range, performance varies is between 0. When compression factor is 0. When the compression factor is less than 0. When compression factor is greater than 0. When compression factor increases to 0.
From the above analysis, Mel-SD has the best compression factor in the case of the number of filter design optimization. Within a certain range, compression impact factor is not serious and it always guarantee better performance than Mel-CD. Best compression factor is close to the approximate expression of the experimental results for static psychoacoustic measurements, these verify its basic relationship for voice quality assessment, the best factor is 0.
The optimization results of Mel-CD and Mel-SD are used speech quality objective evaluation for a communication system under interference conditions, in order to compare the performance, the ITU P. Evaluation results are used the baseline performance. The results in Table 3 and Fig. In order to compare three objective evaluation tests in the comprehensive performance, the average of the eight test performance are given in Table 4.
From the results in Table 4 , there is the comprehensive comparison of the three measure performance results in eight tests. The average correlation value of Mel-SD is 0. The average correlation value of Mel-SD increases 0. The average correlation value of Mel-CD is 0. Performance comparison between the performances which have been optimized and which have not been optimized for Mel-SD and Mel-CD: To compare the influence of parameters to optimize the performance, the performance results of parameter non-optimization of Mel-CD and Mel-SD is in Table 5 , the number of the filter bank is taken to be 24, the compression factor is 0.
Especially the average correlation value of the optimized Mel-CD increases 7. There is a slight increase in the optimized performance of Mel-SD than the no optimized one but the change is not obvious, it also shows the Mel-SD robustness, especially there is robustness for filter bank design. Because filter bank of Mel domain is an important part of the objective measure of Mel domain, therefore, we study affect the performance of two measures in Mel filters. Studies have shown that in a given test, Mel-SD have the structure change robustness of the filter bank, its performance is better than Mel-CD, Mel-CD is more sensitive to changes of the filter structure.
After the number of filters is more than 13, the performance degradates with the filter increased number. Overall performance and computation is complexity. Two Measures should choose the number of filters between Ten is the best number of filters in the two tests measure. Mel-SD has the optimum compression factor in the case of a given number of filters. Within a certain range, compression impact factor is not serious and the performance is better than Mel-CD.
The best compression factor is basically in line with the approximating expression of experimental results in the psychoacoustic static measurements which verify the intensity of the sound-the basic relationship of loudness is suitable for speech quality assessment, the best factor is 0.
By optimization of the performance parameters before and after, an objective evaluation of the analysis shows that the optimization of the performance parameters of Mel-CD is significantly improved but these also further validate the Mel-SD robust to parameter changes. In summary, a reasonable parameter optimization of voice quality evaluation measure in Mel domain can guarantee a good evaluation of performance but also to avoid the computational complexity. After the filter parameters are appropriately selected, Mel-CD has the same equivalent evaluation performance with PESQ, Mel-SD shows a good performance and robustness against parameter variation.
Barnwell III, T. Bush, Statistical correlation between objective and subjective measures for speech quality. Quackenbush, An analysis of objectively computable measures for speech quality testing. Objective measures for speech quality testing. Acoustical Soc. Correlation analysis of subjective and objective measures for speech quality.
A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality results. Chen, G. Hu, Y. Zhang and Y. Zhu, Research advance on objective measures of speech quality. Acta Electronica Sinica, Jin, Mel-spectral distortion measure based on perception model for objective speech quality assessment. Southwest Jiaotong Univ. Steinberg, Factors governing the intelligibility of speech sounds. Yi, B.
Tian and Z. Zhang, One-step strategy of speech quality objective assessment.
Related Voice and Speech Quality Perception: Assessment and Evaluation (Signals and Communication Technology)
Copyright 2019 - All Right Reserved