Perceptual Evaluation and Analysis of Reverberation in Multitrack Music Production

Artificial reverberation is an important music production tool with a strong but poorly understood perceptual impact. A literature review of the relevant works concerned with the perception of musical reverberation is provided, and the use of artificial reverberation in multisource mixes is studied. The perceived amount of total artificial reverberation in a mixture is predicted using relative reverb loudness and early decay time, as extracted from the newly proposed Equivalent Impulse Response. Results indicate that both features have a significant impact on the perception of a mix and that they are closely related to the upper and lower bounds of desired amount of reverberation in a mixture.


INTRODUCTION
Reverberation is one of the most important tools at the disposal of the audio engineer. Essential in any recording studio or live sound system [1], the use of artificial reverb (simply referred to as "reverb" in this work) is widespread in most musical genres and it is among the most universal types of audio processing in music production.
Despite its prominence in music production, there are few studies on the usage and perception of artificial reverberation relevant to this context. The limited research may relate to a lack of universal parameters and interfaces, while algorithms across the available reverb units vary wildly. In comparison, typical equalization (EQ) parameters are standardized and readily translate to other implementations.
The ability to predict the desired amount of reverberation with a reasonable degree of accuracy has applications in automatic mixing and intelligent audio effects [2,3], novel music production interfaces (e.g., various mappings of low-level parameters to more perceptually relevant parameters or terms [4,5]), and compensation of listening conditions [6].
In this work, the previous studies concerned with the automation, preference, and perception of reverberation in music are critically reviewed to establish the requirements for a new methodology (Sec. 1). The problem and definitions used in the remainder of the work are established in Sec. 2. Sec. 3 presents an experiment where a dataset of mixes is perceptually evaluated to explore the relationship between perceived amount of reverberation and the under-lying objective parameters. Analysis of the annotated subjective responses is discussed in Sec. 4. In Sec. 5, the ITU-R BS.1770 loudness of the reverb versus that of the "direct sound" is tested against the mix evaluations. Then, the concept of an Equivalent Impulse Response is introduced and its reverberation time is assessed as a predictor of perceived amount of reverberation (Sec. 6). Concluding remarks and a discussion of future work ensue in Secs. 7 and 8.
The focus of this study is the perception of artificial reverberation of multi-source materials taken from examples of fully-realized, professional music productions. The present case stands apart from the work cited above, where the effect of reverb parameters on the subject's preference or perception is under investigation, as applied to a single source, and typically isolated from any musical, visual or sonic context. As reverberation is a complex and multifaceted matter, controlled experiments are often required. Several of these studies involved only a single, simple, and potentially unpleasant and unfamiliar reverberator  [22,29] [25] [15,16], sometimes without the use of early reflections [2,17] or stereo capabilities [6,18]. In some cases, the number of reverberator parameters were limited, often taking a restricted range or set of values [19][20][21], and applied to a single (type of) source sample [22][23][24]. In [3,25] the parameter values considered were set by unskilled participants using unfamiliar tools and inferior listening environments. Finally, the results of several parameter adjustment tests are not validated through perceptual evaluation [26][27][28]. It has not yet been investigated whether the perception of reverberation amount and time of a single source in isolation has any relevance within the context of multitrack music production, inherently a multidimensional problem, where different amounts and types of reverb are usually applied to different sources, which are then combined to form a coherent mixture. Thus, while relevant for the respective studies, these works may not offer insight into how an audio professional might use reverb in a commercial music production environment.
In order to better understand the use, perception, and preference with regards to reverberation in music, it is deemed necessary to study its application by trained engineers using familiar, professional grade tools in the context of a complete, representative mix. The results of such application should be subjectively evaluated to validate the engineers' choices and gain additional insight into the perceptual impact of differences in reverb. The methodology presented herein, along with the findings from a particular dataset, accommodates analysis of practice and perception of reverb in a less controlled, ecologically valid setting.

PROBLEM FORMULATION
In what follows, the perceived amount of reverberation is predicted based on objective features extracted from both the combined reverb signal and the remainder of the mix. These signals will be referred to as wet (s wet ) and dry (s dr y ), respectively. They are not always easy to extract in practice, even when all source audio and DAW session files, including all parameter settings, are available. This is due to the following conditions: 1) Different amounts and types of reverb are applied to the different sources in the mixture; and 2) Post-reverb, nonlinear processing (dynamic range compression, fader riding, automation of parameters) as well as linear processing (weighting, EQ) are applied to the individual sources as well as the complete mix or subgroups thereof.
Omitting time arguments for readability, tracks n = 1, . . ., N carry the source signals x n that are often already processed before any reverb is applied, giving y n = f pre n (x n ). Reverb (with impulse response h n ) can be added to the processed tracks y n using serial processing, with the reverb plug-in inserted "in-line," where the gain ratio r n ∈ [0, 1] between the wet and dry signal is set within the plug-in (Fig. 1a). Alternatively, reverb is added through parallel processing, with tracks scaled by a gain factor g and sent to a reverb plug-in on a separate bus. Typically, several tracks n m = 1, . . ., N m are sent to the same reverb bus m (Fig. 1b). In both cases, further processing f post n (·) is then applied to the respective tracks and buses, i.e., post-reverb. The wet and dry part of the mix can therefore be expressed as: With h n = (r n h n + (1 − r n ) δ) as the total impulse response of the in-line reverb, reverberant ratio r n included, where δ is the unit impulse, the total mix s tot then becomes (·) is applied to both the wet and dry signal in such a way that their sum still equals the original mix. Any gain changes applied by a dynamic range compressor are dependent on its side-chain signal (equal to the input signal by default). The original mixed signal is thus used for this side-chain signal when processing the dry or wet signal. In other words, in Eqs. (1) and (2), f post n (·) = f post n ·, ∼ h n * y n , the extra argument representing the side-chain signal, so that s tot ≡ s dr y + s wet . For simplicity, it is assumed that this post-processing is applied per track, though in reality it can be applied to groups of sources simultaneously.
The interest herein is how the perceived excess or lack of reverberation amount is influenced by the difference between the loudness of the reverb signal and the dry signal (see [2,6,32]), as well as the overall reverberation time (see [2,15,24]).
The first considered feature, relative reverb loudness (RRL), is defined as where ML is the Momentary Loudness in loudness units (LU) as specified in [33]. The difference of the momentary loudness of the wet and dry signal is calculated for each measurement window, and the average (x) is taken over each window. It should be noted that (forward) masking and binaural dereverberation are not taken into account with this measure. More advanced partial loudness features were used in [2] to predict the perceived amount of reverb. However, such features 1 were not used in this work because the authors found they did not perform well on the considered content, showing weak correlation with perception, and more work is needed to establish the applicability of multi-band loudness models [34], specifically to multisource music. Furthermore, the simple filtered RMS measure used here is far less computationally expensive and suitable for real-time applications. The second feature, reverberation time, is usually derived from the reverberation impulse response (RIR). In the context of this study, however, the RIR is not readily defined, due to conditions (1) and (2) above. As such, the transformation between the mix without reverb and the mix with reverb is not a linear one, and it cannot be defined by an impulse response, even if the reverberator used is applying a linear transformation (which is also not always the case [35]). However, an Equivalent Impulse Response (EIR) can be estimated in which temporal and spectral aspects of the total reverb are embedded: From such an impulse response, traditional (acoustic) reverberation parameters can be extracted, which describe the overall reverberation in universally defined terms such as reverberation time, along with clarity, IR spectral centroid, and central time, which can then be translated to other reverberators [4]. 1 https://github.com/deeuu/loudness/

Design
A set of mixes was created for a number of songs and subsequently compared against each other and subjectively assessed in a multiple-stimulus test. The mixes were to be rated according to "preference" as well as commented on with a free-form text response. The preference rating serves to determine the overall appreciation of the mix and how this correlates with audio features extracted from the mix and its components (see [36]). It further forces the subject to consider which mix they prefer over which, so that they reflect and comment on the aspects that have an impact on their preference.
The goal of this experiment was to uncover which mixes were spontaneously perceived as too reverberant or as not reverberant enough. Therefore, the subjects were not explicitly asked to rate the perceived amount of reverberation. Rather, analysis of the free-form comments reveals mixes in which reverberation-and the relative lack or abundance thereof-was referenced as an issue.
The independent variables of the experiment were mix (or mix engineer) and song. The dependent variables consisted of the preference rating and the free-choice profiling results.

Participants
The mixes were created by 24 master level sound recording students from the same program, all musicians with a Bachelor of Music degree. Each song was mixed by a group of eight students, where each individual student mixed between one and five songs. The average participant was 25.1 ± 1.8 years old, with 5.1 ± 1.9 years of audio engineering experience. Of the 24 participants, 5 were female and 19 were male.
For the perceptual evaluation experiment there were a total of 34 participants: 24 participants from the mix creation process and 10 instructors from the same sound recording program. For each individual song, between 12 and 16 subjects assessed the different mixes. In the context of this work, students did not evaluate any songs they had previously mixed. Each student received a small compensation for their time upon taking part in the listening test.

Materials
Multitrack recordings of 10 different songs, played by professional musicians and recorded by Grammy awardwinning recording engineers, were given to the students tasked with creating a stereo mix from the source tracks. A total of 80 student mixes were created for the experiment. With a few exceptions, the students were unfamiliar with the content before the experiment. Table 2 lists all songs used in the experiment. Those which have a Creative Commons (CC) license have been made available on the Open Multitrack Testbed 2 [37], including source tracks and mixes.  A constrained but representative set of software tools was used to create the mixes, consisting of an industry standard digital audio workstation (DAW) with standard native plug-ins and additional professional reverb plug-ins. The students were familiar with all of these tools. Restricting the toolset allowed for extensive analysis of parameters and the ability to recreate the mix or its constituent tracks, with the various processing units enabled or disabled. As such, the reverb signals could be isolated from the rest of the mix.
The participants produced the different mixes in their preferred mixing location, so as to achieve a natural and representative spread of environments without a bias imposed by a specific acoustic space, reproduction system, or playback level. A limit of six hours of mixing time was imposed on the participants, but no further directions were given.
In addition to these eight mixes, the original, commercial mix was also provided in the listening test, and in some cases a machine-made mix though these are not included in the analysis as the parameter data is not available for these versions. The songs were selected from a wide range of genres to average out differences in genre-specific mixing approaches and signal characteristics and to allow for analysis of the influence of genre.
Further analysis of the mixes (Secs. 5 and 6) was conducted using the 71 mixes where all parameters were accessible and the mix could be perfectly recreated. In the other cases, participants used more than the permitted set of tools.

Apparatus
The listening test interface (from [38,39], see Fig. 2) consisted of a single horizontal preference axis, with each mix represented by a numbered, vertical marker, and a corresponding text box for comments on that mix. An extra text box was provided for general comments on all mixes or the song as a whole. No anchors or references were included, and each fragment could be auditioned as many times as desired. Song and mix order was fully randomized, and all mixes were scaled to equal loudness according to [40]. At the end of the fragment, playback would loop to the start of that fragment. The fragments were aligned so that upon switching between fragments, the new fragment would start playing from the corresponding position. Playback could be paused and reset to the beginning by clicking the stop button.
The test took place in a professional-grade listening room with a high quality audio interface and loudspeakers [36]. Headphones were not used to avoid the sensory discrepancy between vision and hearing, as well as the expected differences in terms of preferred reverberation between headphone and speaker listening [41].

Procedure
The listening test was conducted with one participant at a time. After having been shown how to operate the interface, the participants were asked-both written and verbally-to audition the samples as often as desired, rate the different mixes according to their preference, and to write extensive comments in support of their ratings, for instance "why they rated a fragment the way they did" and "what was particular or different about it." They were instructed to first set the listening level as they wished, since their judgments are most relevant when listening at a comfortable and familiar level [42], and since the perceived reverberation amount varies with level [6,25]. The instructions further stated participants could use the preference rating scale however they saw fit.
To reduce strain on the subjects, a fragment containing the second verse and second chorus of the song was selected from each mix, averaging one minute in length. This section was considered maximally representative as most sources were active in this part of the song. With up to 10 mixes per song, and up to 4 songs per test, the test length was well below the recommended duration limit of 90 minutes [43], and the possibility to take breaks was given to participants.

COMMENT ANALYSIS
To allow quantitative processing, every comment was split into its constituent statements. In total, 4227 separate statements were annotated from 1326 comments. Of these comments, 35.44% mention reverberation, and reverberation is not commented on by anyone in only 2 of the 80 mixes considered here. Furthermore, every subject commented on reverberation for at least 10% of the mixes they assessed. The comments were classified into three classes: "Too much reverb," "Not enough reverb," and-when unrelated to the perceived amount of reverberation-Neither.
Participants disagreed on whether there was too much or too little reverberation in only 4 of the 525 comments that mention reverberation. This supports the idea that mix engineers have a consistent judgment on the "correct" reverberation amount for a given mix. The low variance in the results may be explained by the fact that test participants are skilled listeners [25]. In the following sections, only comments regarding the subjective excess or shortage of reverberation of the whole mix (i.e., not any particular instrument) are considered. Fig. 3 shows the mean preference ratings associated with statements from the different classes. As previously observed in [32,44], the preference rating for a mix the subject found too reverberant is significantly lower than if it was considered too dry.

RELATIVE REVERB LOUDNESS
The relative reverb loudness is shown for each mix in Fig. 4, along with the number of subjects who indicated the mix was perceived as too reverberant or not reverberant enough, divided by the total number of subjects for that song. As expected, the majority of the mixes labelled "too reverberant" have a significantly higher relative reverb loudness than those labelled "not reverberant enough." Overall, the preferred reverb loudness seems to differ significantly from [32], where the optimal reverb return loudness is estimated to be at −9 LU. In the current experiment, every mix with a relative reverb loudness of −9 LU or higher was judged to be too reverberant, and −14 LU appears to be a more desirable loudness as it is in between 95% confidence intervals of the medians of either labelled group.
The differences in reverb loudness are mostly subtle, with the just-noticeable difference (JND) of direct-to-reverberant ratio estimated at 5-6 dB [45], proof of the critical nature of the engineer's task. Despite this, there is a large level of agreement with regard to what mixes have a reverb surplus or deficit. The variance of preferred reverb level is considerably larger in [25], possibly due to the unskilled listeners.
There are some cases where despite a relatively high reverb loudness, subjects agreed that there was not enough reverberation (e.g., mix 3C or 5C in Fig. 4), or where mixes with a perceived excess of reverb did not exhibit a significantly higher-than-average measured loudness (e.g., 1B, 8P). Closer study of these outliers, through informal listening and analysis of parameter settings, revealed that mixes with a high perceived amount of reverberation but low measured reverb loudness typically have a long reverberation tail. Those marked as too dry have a strong, yet short and clear reverb signal, to the point of sounding similar to the dry input. As in [2], it would seem relative loudness of the reverb signal alone is generally insufficient to predict the perceived or preferred amount of reverberation. It is therefore believed that measuring the reverberation time will help explain the perceived amount of reverberation [21,23,31].

Process
For the practical measurement of the EIR h eq (see Eq. (5)) it is not possible to use sine sweep or maximum length sequence (MLS) methods due to condition (1) from Sec.
In this case, the equivalent frequency response H eq is a frequency-and gain-weighted version of the various reverb frequency responses H n and H m , being dependent on the post-processing, the (pre-processed) input signals, and the wet to dry ratios. This interpretation is violated to the extent that f post n (·) is not a linear function, see condition (2) from Sec. 2. In the case it is approximately linear but not stationary, the equivalent frequency response can describe the total reverb with reasonable accuracy as a function of time.
Neglecting any nonlinearities, the EIR is obtained by division of the signals (s wet and s dr y ) in the spectral domain (also dual channel FFT analysis) [46]. Following Welch's method, complex averaging is performed on both the dry signal's power spectrum or auto spectrum (G (i) dr y,dr y ) and the cross spectrum (G (i) dr y,wet ), taken from signal segments   Table 2). The box plots show the relative loudness values for mixes collectively found to be too "wet" and "dry," respectively; here, the center line denotes the median, the box extends from the 25 th to the 75 th percentile, the notch is the median's confidence interval, and the whiskers span from the lowest to the highest value. i = 1. . .I, with 50% overlap and a Hann window: where iFFT is the inverse Fast Fourier Transform. The window length has been empirically obtained to produce the impulse response with the lowest noise floor while still being sufficiently long compared to the reverberation times.
In contrast to most work on impulse response estimation and room impulse response inversion, in this case there is no reference or error measure to objectively evaluate the quality of the obtained impulse response. Convolving the dry signal with the EIR will rarely approximate the wet signal, due to condition (1).
While stereo reverberation generated from a monaural source is generally defined by two impulse responses (one for each channel), and stereo reverberation of a stereo source by four (h L→L , h L→R , . . .), for the purpose of this study a single impulse response is extracted from the spectral division of the wet and dry signal, each summed to mono. It has been shown that with identical reverberation times and level, mono and stereo reverberation signals are perceived as having equal loudness regardless of the source material [44].
From this impulse response, it is possible to extract reverberation time measures such as the Early Decay Time (EDT). This is a particularly suitable feature as the calculated impulse responses are noisy. Furthermore, it has been shown that the EDT is more closely related to the conscious perception of reverberation, especially while the source is still playing during the reverberation decay, as is the case here [14,31].

Equivalent Impulse Response Analysis and Results
Fig . 5 shows all mixes as a function of their reverb loudness and reverb time and labeled according to the net number of subjects who classified them as either "Too much reverb," "Not enough reverb," or Neither. The relative reverb loudness is as computed in Sec. 5, and the EDT is calculated from the EIR using the decay method, equivalent to six times the time it takes for the decay curve to reach −10 dB, an estimation of T 60 [47]. The logarithm of the EDTI see 'EDT' is now mostly regular and not italicised (which is fine), but there are still a few instances where it is in italics, such as here and in the figure caption. -I now changed this here. Same for 'Relative Reverb Loudness' ('RRL'). is used to better visualize a few large values, and this also makes the distribution normal.
As the dependent variable is a binary classification into "too reverberant" or "not reverberant enough," a logistic regression is performed based on the measurements of relative reverb loudness and EDT, for each assignment to either category by a subject. Comparing this to a restricted model with only the relative reverb loudness (RRL) as a predictor variable, a statistically significant increase is seen in the model fit (likelihood ratio -2ln L both /L RRL = 7.749, i.e. p = .005 on a χ 2 distribution)-that is, the EDT is indeed helpful in explaining the perception of the reverberation amount. The decision boundaries at .25, .50, and .75 are shown in Fig. 5, along with the .50 decision boundaries for the individual predictor variables.
Such a sharp transition between what is considered too reverberant and too dry, again emphasizes the importance of careful adjustment of reverb parameters. This is further supported by the observation in [29] that masking causes reverberation audibility to decrease by 4 dB for every dB decrease in reverberant level. The differences in reverberation time between the different mixes are mostly of the order of the JND [18], as was the case with the differences in relative reverb loudness.

SUMMARY AND CONCLUSION
An experiment was conducted where 80 mixes were generated from 10 professional-grade music recordings by trained engineers in a familiar and commercially representative setting, which were then rated in multi-stimulus listening tests. Annotated subjective comments were analyzed to determine the importance of reverberation in the perception of mixes, as well as to classify mixes having too much or too little overall reverberation. This study is different from previous work in that it examines reverb in a relevant music production context, where reverb is applied to multiple tracks in varying degrees and types.
Although the perceptual evaluation experiment purposely did not mention reverberation as a feature to consider, it is commented on in 35% of the cases, confirming that differences in reverb use have a large impact on the perceived quality of a mix [44], as assessed by skilled listeners. Notwithstanding the less controlled nature of the study, variance in its findings is significantly narrower than in similar work, likely due in part to proficiency of participants in both the mix experiment and subsequent perceptual evaluation.
To a large extent, the relative reverb loudness gives a suitable indication of how audible or objectionable reverberation is. These subjective judgments are further predicted by considering reverb decay time, derived from a newly proposed Equivalent Impulse Response that captures reverberation characteristics for a mixture of sources with varying degrees and types of reverb. Both measures are suitable for real-time applications such as automated reverberators or assistive interfaces.
The results support the notion that a universally preferred amount of reverberation is unlikely to exist, but show that upper and lower bounds can be identified with reasonable confidence. The importance of careful parameter adjustment is evident from the limited range of acceptable feature values with regard to perceived amount of reverberation, when compared to the just-noticeable differences in both relative reverb loudness and the Equivalent Impulse Response's EDT. This study confirms previous findings that a perceived excess of reverberation typically has a more detrimental effect on subjective preference than when the reverberation level was indicated to be too low, suggesting it is better to err on the "dry" side.

FUTURE WORK
Future implementations should take into account how reverberant the "dry" signal is, particularly when the original tracks contain a significant amount of reverberation. Source separation or dereverberation could help separate the two for a more accurate estimation of the dry and wet sound.
A new dataset with mixes and perceptual evaluations from subjects of various backgrounds, locations, and levels of expertise (including laymen) is required in order to analyze the consistency of reverberation preferences across different populations.
Artificial reverberation is defined by far more attributes, objective and perceptual, than those covered in this work. Further features and parameters to consider include predelay [29], echo density [35], autocorrelation [32], and more sophisticated loudness features [2].
Finally, the data collected in this mix experiment and the subsequent perceptual evaluation can be used to study perception and use of other music production tools such as balance, EQ, and dynamic range compression. In the interest of reproducibility and to allow easy extension of this work, the source tracks, stereo mixes, DAW files, and extracted reverberant and dry signals were made available in the Open Multitrack Testbed 4 [37] for the six songs licensed under a Creative Commons license.

ACKNOWLEDGMENTS
This work was made possible by the Engineering and Physical Sciences Research Council Grant EP/K009559/1 "Platform Grant: Digital Music," 2013-18. The authors also wish to thank Dominic Ward for a fruitful discussion on loudness models and related features.