Krisp Research Team, Author at Krisp

Speech Enhancement On The Inbound Stream: Challenges

Krisp Research Team — Tue, 10 Oct 2023 12:12:52 +0000

Introduction

In today’s digital age, online communication is essential in our everyday lives. Many of us find ourselves engaged in many online meetings and telephone conversations throughout the day. Due to the nature of our work in today’s world, some find themselves having calls or attending meetings in less-than-ideal environments such as cafes, parks, or even cars. One persistent issue during our virtual interactions is the prevalence of background noise emitted from the other end of the call. This interference can lead to significant misunderstandings and disruptions, impairing comprehension of the speaker’s intended messages.

To effectively address the problem of unwanted noise in online communication, an ideal solution would involve applying noise cancellation on each user’s end, specifically on the signal captured by their microphone. This technology would effectively eliminate background sounds for each respective speaker, enhancing intelligibility and maximizing the overall online communication experience. Unfortunately, this is not the case, and not everyone has noise cancellation software, meaning they may not sound clear. This is where the inbound noise cancellation comes into play.

When delving into the topic of noise cancellation, it’s not uncommon to see terms such as speech enhancement, noise reduction, noise suppression, and speech separation used in a similar context.

In this article, we will go over the inbound speech enhancement and the main challenges applying it to online communication. But first, let’s dive in and understand the buzz and the difference between the inbound and outbound streams in the terminology of speech enhancement.

The difference in speech enhancement on the inbound and the outbound streams explained

The terms inbound and outbound we have referred to in the context of speech enhancement are basically about at which point in the communication pipeline the enhancement takes place.

Speech Enhancement On Outbound Streams

In the case of speech enhancement on the outbound stream, the algorithms are applied at the sending end after capturing the sound from the microphone but before transmitting the signal. In one of our previously published articles on Speech Enhancement, we focused mainly on the outbound use case.

Speech Enhancement On Inbound Streams

In the case of speech enhancement on the inbound stream, the algorithms are applied at the receiving end – after receiving the signal from the network but before passing it to the actual loudspeaker/headphone. Unlike outbound speech enhancement, which preserves the speaker’s voice, inbound speech enhancement has to preserve the voices of multiple speakers, all while canceling background noise.

Together, these technologies revolutionize online communication, delivering an unparalleled experience for users around the world.

The challenges of applying speech enhancement in online communication

Online communication is a diverse landscape, encompassing a wide array of scenarios, devices, and network conditions. As we strive to enhance the quality of audio in both outbound and inbound communication, we encounter several challenges that transcend these two realms.

Wide Range of Devices Support

Online communication takes place across an extensive spectrum of devices. The microphones in those devices range from built-in laptop microphones to high-end headphones, webcams with integrated microphones, and external microphone setups. Each of these microphones may have varying levels of quality and sensitivity. Ensuring that speech enhancement algorithms can adapt to and optimize audio quality across this device spectrum is a significant challenge. Moreover, the user experience should remain consistent regardless of the hardware in use.

Wide Range of Bandwidths Support

One critical aspect of audio quality is the signal bandwidth. Different devices and audio setups may capture and transmit audio at varying signal bandwidths. Some may capture a broad range of frequencies, while others may have limited bandwidth. Speech enhancement algorithms must be capable of processing audio signals across this spectrum efficiently. This includes preserving the essential components of the audio signal while adapting to the limitations or capabilities of the particular bandwidth, ensuring that audio quality remains as high as possible.

Strict CPU Limitations

Online communication is accessible to a broad audience, including individuals with low-end PCs or devices with limited processing power. Balancing the demand for advanced speech enhancement with these strict CPU limitations is a delicate task. Engineers must create algorithms that are both computationally efficient and capable of running smoothly on a range of hardware configurations.

Large Variety of Noises Support

Background noise in online communication can vary from simple constant noise, like the hum of an air conditioner, to complex and rapidly changing noises. Speech enhancement algorithms must be robust enough to identify and suppress a wide variety of noise sources effectively. This includes distinguishing between speech and non-speech sounds, as well as addressing challenges posed by non-stationary noises, such as music or babble noise.

The challenges specific to inbound speech enhancement

As we have discussed, Inbound speech enhancement may be a critical component of online communication, focusing on improving the quality of audio received by users. However, this task comes with its own set of intricate challenges that demand innovative solutions. Here, we delve into the unique challenges faced when enhancing incoming audio streams.

Multiple Speakers’ Overlapping Speech Support

One of the foremost challenges in inbound speech enhancement is dealing with multiple speakers whose voices overlap. In scenarios like group video calls or virtual meetings, participants may speak simultaneously, making it challenging to distinguish individual voices. Effective inbound speech enhancement algorithms must not only reduce background noise but also keep the overlapping speech untouched, ensuring that every participant’s voice is clear and discernible to the listener.

Diversity of Users’ Microphones

Online communication accommodates an extensive range of devices from different speakers. Users may connect via mobile phones, car audio systems, landline phones, or a multitude of other devices. Each of these devices can have distinct characteristics, microphone quality, and signal processing capabilities. Ensuring that inbound speech enhancement works seamlessly across this diverse array of devices is a complex challenge that requires robust adaptability and optimization.

Wide Variety of Audio Codecs Support

Audio codecs are used to compress and transmit audio data efficiently over the internet. However, different conferencing applications and devices may employ various audio codecs, each with its own compression techniques and quality levels. Inbound speech enhancement must be codec-agnostic, capable of processing audio streams regardless of the codec in use, to ensure consistently high audio quality for users.

Software Processing of Various Conferencing Applications Support

Online communication occurs through a multitude of conferencing applications, each with its unique software processing and audio transmission methods. Optimally, inbound speech enhancement should be engineered to seamlessly integrate with any of these diverse applications while maintaining an uncompromised level of audio quality. This requirement is independent of any audio processing technique utilized in the application. These processes can span a wide spectrum from automatic gain control to proprietary noise cancellation solutions, each potentially introducing different levels and types of audio degradation.

Internet Issues and Lost Packets Support

Internet connectivity is prone to disruptions, leading to variable network conditions and packet loss. Inbound speech enhancement faces the challenge of coping with these issues gracefully. The algorithm must be capable of maintaining the audio quality in case of lost audio packets. The ideal solution would even be able to mitigate the damage caused by poor networks by advanced Packet Loss Concealment algorithms.

How to face Inbound use case challenges?

As mentioned in the Speech Enhancement blog post, to create a high-quality speech enhancement model, one needs to apply deep learning methods. Moreover, in the case of inbound speech enhancement, we need to apply more diverse data augmentation reflecting real-life scenarios of inbound use cases to obtain high-quality training data for the deep learning model. Now, let’s delve into some data augmentations specific to inbound scenarios

Modeling of multiple speaker calls

In the quest for an effective solution to inbound speech enhancement, one crucial aspect is the modeling of multi-speaker and multi-noise scenarios. In an ideal approach, the system employs sophisticated audio augmentation techniques to generate audio mixes (see the previous article) that consist of multiple voice and noise sources. These sources are thoughtfully designed to simulate real-world scenarios, particularly those involving multiple speakers speaking concurrently, common in virtual meetings and webinars.

Through meticulous modeling, each audio source is imbued with distinct acoustic characteristics, capturing the essence of different environments and devices. These scenarios are carefully crafted to represent the challenges of online communication, where users encounter a dynamic soundscape. By training the system with these diverse multi-speaker and multi-noise mixes, it gains the capability to adeptly distinguish individual voices and suppress background noise.

Modeling of diverse acoustic conditions

Building inbound speech enhancement requires more than just understanding the mix of voices and noise; it also involves accurately representing the acoustic properties of different environments. In the suggested solution, this is achieved through Room Impulse Response (RIR) and Infinite Impulse Response (IIR) modeling.

RIR modeling involves applying filters to mimic the way sound reflects and propagates in a room environment. These filters capture the unique audio characteristics of different environments, from small meeting rooms to bustling cafes. Simultaneously, IIR filters are meticulously designed to match the specific characteristics of different microphones, replicating their distinct audio signatures. By applying these filters, the system ensures that audio enhancement remains realistic and adaptable across a wide range of settings, further elevating the inbound communication experience.

Versatility Through Codec Adaptability

An essential aspect of an ideal solution for inbound speech enhancement is versatility in codec support. In online communication, various conferencing applications employ a range of audio codecs for data compression and transmission. These codecs can vary significantly in terms of compression efficiency and audio quality, from enterprise-specific VOIP codecs like G.711 and G.729 to high-end solutions such as OPUS and SILK.

To offer an optimal experience, the solution should be codec-agnostic, capable of seamlessly processing audio streams regardless of the codec in use. To meet this goal, we need to pass raw audio signals to various audio codecs as the codec augmentation of training data.

Summary

Inbound speech enhancement refers to the software processing that occurs at the listener’s end in online communication. This task comes with several challenges, including handling multiple speakers in the same audio stream, adapting to diverse acoustic environments, and accommodating various devices and software preferences used by users. In this discussion, we explored a series of augmentations that can be integrated into a neural network training pipeline to address these challenges and offer an optimal solution.

Try next-level audio and voice technologies

Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.

References

The article is written by:

Ruben Hasratyan, MBA, BSc in Physics, Staff ML Engineer, Tech Lead
Stepan Sargsyan, PhD in Mathematical Analysis, ML Architect, Tech Lead

The post Speech Enhancement On The Inbound Stream: Challenges appeared first on Krisp.

Applying the Seven Testing Principles to AI Testing

Krisp Research Team — Mon, 28 Aug 2023 08:36:13 +0000

Introduction

With the constant growth of artificial intelligence, numerous algorithms and products are being introduced into the market on a daily basis, each with its unique functionality in various fields. Like software systems, AI systems also have their distinct lifecycle, and quality testing is a critical component of it. However, in contrast to software systems, where quality testing is focused on verifying the desired behavior generated by the logic, the primary objective in AI system quality testing is to ensure that the logic is in line with the expected behavior. Hence, the testing object in AI systems is the logic itself rather than the desired behavior. Also, unlike traditional software systems, AI systems necessitate careful consideration of tradeoffs, such as the delicate balance between overall accuracy and CPU usage.

Despite their differences, the seven principles of software testing outlined by ISTQB (International Software Testing Qualifications Board) can be easily applied to AI systems. This article will examine each principle in-depth and explore its correlation with the testing of AI systems. We will base this exposition on Krisp’s Voice Processing algorithms, which incorporate a family of pioneering technologies aimed at enhancing voice communication. Each of Krisp’s voice-related algorithms, based on machine learning paradigms, is amenable to the general testing principles we discuss below. However, for the sake of illustration and as our reference point, we will use Krisp’s Noise Cancellation technology (referred to as NC). This technology is based on a cutting-edge algorithm that operates on the device in real-time and is renowned for its ability to enhance the audio and video call experience by effectively eliminating disruptive background noises. The algorithm’s efficacy can be influenced by diverse factors, including usage conditions, environments, device specifications, and encountered scenarios. Rigorous testing of all these parameters becomes indispensable, accomplished by employing a diverse range of test datasets and conducting both subjective and objective evaluations. Therefore, it is crucial to test all these parameters, while keeping in mind the seven testing principles both,

1. Testing shows the presence of defects, not their absence

This principle underscores that despite the effort invested in finding and fixing errors in software, it cannot guarantee that the product will be entirely defect-free. Even after comprehensive testing, defects can still persist in the system. Hence, it is crucial to continually test and refine the system to minimize the risk of defects that may cause harm to users or the organization.

This is particularly significant in the context of AI systems, as AI algorithms are in general not expected to achieve 100% accuracy. For instance, testing may reveal that an underlying AI-based algorithm, such as the NC algorithm, has issues when canceling out white noises. Even after addressing the problem for observed scenarios and improving accuracy, there is no guarantee that the issue will never resurface in other environments and be detected by users. Testing only serves to reduce the likelihood of such occurrences.

2. Exhaustive testing is impossible

The second testing principle highlights that achieving complete test coverage by testing every possible combination of inputs, preconditions, and environmental factors is not feasible. This challenge is even more profound in AI systems since the number of possible combinations that the algorithm needs to cover is often infinite.

For instance, in the case of the NC algorithm, testing every microphone device, in every condition, with every existing noise type would be an endless endeavor. Therefore, it is essential to prioritize testing based on risk analysis and business objectives. For example, testing should focus on ensuring that the algorithm does not cause voice cutouts with the most popular office microphone devices in typical office conditions.

By prioritizing testing activities based on identified risks and critical functionalities, testing efforts can be optimized to ensure that the most important defects are detected and addressed first. This approach helps to balance the need for thorough testing with the practical realities of finite time, resources, and costs associated with testing.

3. Early testing saves time and money

This principle emphasizes the importance of conducting testing activities as early as possible in the development lifecycle, including AI products. Early testing is a cost-effective approach to identifying and resolving defects in the initial phases, reducing expenses and time, whereas detecting errors at the last minute can be the most expensive to rectify.

Ideally, testing should start during the algorithm’s research phase, way before it becomes an actual product. This can help to identify possible undesirable behavior and prevent it from propagating to the later stages. This is especially crucial since the AI algorithm training process can take weeks if not months. Detecting errors in the late stages can lead to significant release delays, impacting the project’s timeline and budget.

Testing the algorithm’s initial experimental version can help to identify stability concerns and reveal any limitations that require further refinement. Furthermore, early testing helps to validate whether the selected solution and research direction are appropriate for addressing the problem at hand. Additionally, analyzing and detecting patterns in the algorithm’s behavior assists in gaining insights into the research trajectory and potential issues that may arise.

In the context of the NC project, early testing can help to evaluate whether the algorithm works steadily when subjected to different scenarios, such as retaining considerable noise after a prolonged period of silence. Finding and addressing such patterns at an early stage ensures smoother and more efficient resolution.

4. Defects cluster together

The principle of defect clustering suggests that software defects and errors are not distributed randomly but instead tend to cluster or aggregate in certain areas or components of the system. This means that if multiple errors are found in a specific area of the software, it is highly likely that there are more defects lurking in that same region. This principle also applies to AI systems, where the discovery of logically undesired behavior is a red flag that future tests will likely uncover similar issues.

For example, suppose a noise cancellation algorithm struggles to cancel TV fuzz noise when presented alongside speech. In that case, further tests may reveal the same problem when the algorithm is faced with background noise from an air conditioner. Although both noises may seem alike, careful observation of the frequency visual representation (Spectrogram) clearly highlights the noticeable distinction.

TV Fuzz Noise

https://krisp.ai/blog/wp-content/uploads/2023/08/Audio-TV_Fuzz_Noise.wav

AC Noise

https://krisp.ai/blog/wp-content/uploads/2023/08/Audio-Air_Conditioner_Noise.wav

Credits: https://www.youtube.com/

Identifying high-risk areas that could potentially damage user experience is another way that the principle of defect clustering can be beneficial. During the testing process, these areas can be given top priority. Using the same noise cancellation algorithm as an example, the highest-risk area is the user’s speech quality, and voice-cutting issues are considered the most critical defects. If the algorithm exhibits voice-cutting behavior in one scenario, it is possible that it may also cause voice degradation in other scenarios.

5. Beware of the pesticide paradox

The pesticide paradox is a well-known phenomenon in pest control, where applying the same pesticide to pests can lead to immunity and resistance. The same principle applies to software and AI testing, where the “pesticide” is testing, and the “pests” are software bugs. Running the same set of tests repeatedly may lead to the diminishing effectiveness of those tests as both AI and software systems adapt and become immune to them.

To avoid the pesticide paradox in software and AI testing, it’s essential to use a range of testing techniques and tools and to update tests regularly. When existing test cases no longer reveal new defects, it’s time to implement new cases or incorporate new techniques and tools.

Similarly, AI systems must be constantly monitored for the pesticide paradox, known as “overfitting.” Although practically an AI system can never be entirely free of bugs, it’s essential to update test datasets and cases, and possibly introduce new metrics or tools, once the system can easily handle existing tests.

Take the example of a noise cancellation algorithm that initially worked almost flawlessly on audios with low-intensity noise. To enhance the algorithm’s quality and discover new bugs, new datasets were created, featuring higher-intensity noises that reached speech loudness.

By avoiding the pesticide paradox in both software and AI testing, we can protect our systems and ensure they remain effective in their respective fields.

6. Testing is context dependent

Testing methods must be tailored to the specific context in which they will be implemented. This means that testing is always context-dependent, with each software or AI system requiring unique testing approaches, techniques, and tools.

In the realm of AI, the context includes a range of factors, such as the system’s purpose, the characteristics of the user base, whether the system operates in real-time or not, and the technologies utilized.

Context is of paramount importance when it comes to developing an effective testing strategy, techniques, and tools for the NC algorithms. Part of this context involves understanding the needs of the users, which can assist in selecting appropriate testing tools such as microphone devices. As a result, an understanding of context enables the reproduction and testing of users’ real-life scenarios.

By considering context, it becomes possible to create better-designed tests that accurately represent actual use cases, leading to more reliable and accurate results. Ultimately, this approach can ensure that testing is more closely aligned with real-world situations, increasing the overall effectiveness and utility of the NC algorithm.

7. Absence-of-errors is a fallacy

The notion that the absence of errors can be achieved through complete test coverage is a fallacy. In some cases, there may be unrealistic expectations that every possible test should be run and every potential case should be checked. However, the principles of testing remind us that achieving complete test coverage is practically impossible. Moreover, it is incorrect to assume that complete test coverage guarantees the success of the system. For example, the system may function flawlessly but still be difficult to use and fail to meet the needs and requirements of users from the product perspective. This also applies to AI systems as well.

Consider, for instance, the NC algorithm, which may have excellent quality, cancel out noise and maintain speech clarity, making it virtually free of defects. However, it being too complex and large, may render it useless as a product.

Conclusion

To sum up, AI systems have unique lifecycles and require a different approach to testing from traditional software systems. Nevertheless, the seven principles of software testing outlined by the ISTQB can be applied to AI systems. Applying these principles can help identify critical defects, prioritize testing efforts, and optimize testing activities to ensure that the most important defects are detected and addressed first. All Krisp’s algorithms have demonstrated that adhering to these rules leads to improved accuracy, higher quality, and increased reliability and safety of the AI system in various fields.

Try next-level audio and voice technologies

Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.

References

ISTQB CTFL Syllabus v4.0

The article is written by:

Tatevik Yeghiazaryan, BSc in Software Engineering, Senior QA Engineer

The post Applying the Seven Testing Principles to AI Testing appeared first on Krisp.

Can You Hear a Room?

Krisp Research Team — Thu, 06 Jul 2023 06:24:19 +0000

Introduction

Sound propagation has distinctive features associated with the environment where it happens. Human ears can often clearly distinguish whether a given sound recording was produced in a small room, large room, or outdoors. One can even get a sense of a direction or a distance from the sound source by listening to a recording. These characteristics are defined by the objects around the listener or a recording microphone such as the size and material of walls in a room, furniture, people, etc. Every object has its own sound reflection, absorption, and diffraction properties, and all of them together define the way a sound propagates, reflects, attenuates, and reaches the listener.

In acoustic signal processing, one often needs a way to model the sound field in a room with certain characteristics, in order to reproduce a sound in that specific setting, so to speak. Of course, one could simply go to that room, reproduce the required sound and record it with a microphone. However, in many cases, this is inconvenient or even infeasible.

For example, suppose we want to build a Deep Neural Net (DNN)-based voice assistant in a device with a microphone that receives pre-defined voice commands and performs actions accordingly. We need to make our DNN model robust to various room conditions. To this end, we could arrange many rooms with various conditions, reproduce/record our commands in those rooms, and feed the obtained data to our model. Now, if we decide to add a new command, we would have to do all this work once again. Other examples are Virtual Reality (VR) applications or architectural planning of buildings where we need to model the acoustic environment in places that simply do not exist in reality.

In the case of our voice assistant, it would be beneficial to be able to encode and digitally record the acoustic properties of a room in some way so that we could take any sound recording and “embed” it in the room by using the room “encoding”. This would free us from physically accessing the room every time we need it. In the case of VR or architectural planning applications, the goal then would be to digitally generate a room’s encoding only based on its desired physical dimensions and the materials and objects contained in it.

Thus, we are looking for a way to capture the acoustic properties of a room in a digital record, so that we can reproduce any given audio recording as if it was played in that room. This would be a digital acoustic model of the room representing its geometry, materials, and other things that make us “hear a room” in a certain sense.

What is RIR?

Room impulse response (RIR for short) is something that does capture room acoustics, to a large extent. A room with a given sound source and a receiver can be thought of as a black-box system. Upon receiving on its input a sound signal emitted by the source, the system transforms it and outputs whatever is received at the receiver. The transformation corresponds to the reflections, scattering, diffraction, attenuation and other effects that the signal undergoes before reaching the receiver. Impulse response describes such systems under the assumption of time-invariance and linearity. In the case of RIR, time-invariance means that the room is in a steady state, i.e, the acoustic conditions do not change over time. For example, a room with people moving around, or a room where the outside noise can be heard, is not time invariant since the acoustic conditions change with time. Linearity means that if the input signal is a scaled superposition of two other signals, x and y, then the output signal is a similarly scaled superposition of the output signals corresponding to x and y, individually. Linearity holds with sufficient fidelity in most practical acoustic environments (while time-invariance can be achieved in a controlled environment).

Let us take a digital approximation of a sound signal. It is a sequence of discrete samples, as shown in Fig. 1.

Fig. 1 The waveform of a sound signal.

Each sample is a positive or negative number that corresponds to the degree of instantaneous excitation of the sound source, e.g., a loudspeaker membrane, as measured at discrete time steps. It can be viewed as an extremely short sound, or an impulse. The signal can thus be approximately viewed as a sequence of scaled impulses. Now, given time-invariance and linearity of the system, some mathematics shows that the effect of a room-source-receiver system on an audio signal can be completely described by its effect on a single impulse, which is usually referred to as an impulse response. More concretely, impulse response is a function h(t) of time t > 0 (response to a unit impulse at time t = 0) such that for an input sound signal x(t), the system’s output is given by the convolution between the input and the impulse response. This is a mathematical operation that, informally speaking, produces a weighted sum of the delayed versions of the input signal where weights are defined by the impulse response. This reflects the intuitive fact that the received signal at time t is a combination of delayed and attenuated values of the original signal up to time t, corresponding to reflections from walls and other objects, as well as scattering, attenuation and other acoustic effects.

For example, in the recordings below, one can see the RIR recorded by a clapping sound (see below), an anechoic recording of singing, and their convolution.

RIR

https://krisp.ai/blog/wp-content/uploads/2023/07/clap_rir.wav

Singing anechoic

https://krisp.ai/blog/wp-content/uploads/2023/07/singing_anechoic.wav

Singing with RIR

https://krisp.ai/blog/wp-content/uploads/2023/07/singing_rir.wav

It is often useful to consider sound signals in the frequency domain, as opposed to the time domain. It is known from Fourier analysis that every well-behaved periodic function can be expressed as a sum (infinite, in general) of scaled sinusoids. The sequence of the (complex) coefficients of sinusoids within the sum, the Fourier coefficients, provides another, yet equivalent representation of the function. In other words, a sound signal can be viewed as a superposition of sinusoidal sound waves or tones of different frequencies, and the Fourier coefficients show the contribution of each frequency in the signal. For finite sequences such as digital audio, that are of practical interest, such decompositions into periodic waves can be efficiently computed via the Fast Fourier Transform (FFT).

For non-stationary signals such as speech and music, it is more instructive to do analysis using the short-time Fourier transform (STFT). Here, we split the signal into short equal-length segments and compute the Fourier transform for each segment. This shows how the frequency content of the signal evolves with time (see Fig. 2). That is, while the signal waveform and Fourier transform give us only time and only frequency information about the signal (although one being recoverable from another), the STFT provides something in between.

Fig. 2 Spectrogram of a speech signal.

A visual representation of an STFT, such as the one in Fig. 2, is called a spectrogram. The horizontal and vertical axes show time and frequency, respectively, while the color intensity represents the magnitude of the corresponding Fourier coefficient on a logarithmic scale (the brighter the color, the larger is the magnitude of the frequency at the given time).

Measurement and Structure of RIR

In theory, the impulse response of a system can be measured by feeding it a unit impulse and recording whatever comes at the output with a microphone. Still, in practice, we cannot produce an instantaneous and powerful audio signal. Instead, one could record RIR approximately by using short impulsive sounds. One could use a clapping sound, a starter gun, a balloon popping sound, or the sound of an electric spark discharge.

Fig. 3 The spectrogram and waveform of a RIR produced by a clapping sound.

The results of such measurements (see, for example, Fig. 3) may be not sufficiently accurate for a particular application, due to the error introduced by the structure of the input signal. An ideal impulse, in some mathematical sense, has a flat spectrum, that is, it contains all frequencies with equal magnitude. The impulsive sounds above usually significantly deviate from this property. Measurements with such signals may also be poorly reproducible. Alternatively, a digitally created impulsive sound with desired characteristics could be played with a loudspeaker, but the power of the signal would still be limited by speaker characteristics. Among other limitations of measurements with impulsive sounds are: particular sensitivity to external noise (from outside the room), sensitivity to nonlinear effects of the recording microphone or emitting speaker, and the directionality of the sound source.

Fortunately, there are more robust methods of measuring room impulse response. The main idea behind these techniques is to play a transformed impulsive sound with a speaker, record the output, and apply an inverse transform to recover impulse response. The rationale is, since we cannot play an impulse as it is with sufficient power, we “spread” its power across time, so to speak, while maintaining the flat spectrum property over a useful range of frequencies. An example of such a “stretched impulse” is shown in Fig. 4.

Fig. 4 A “stretched” impulse.

Other variants of such signals are Maximum Length Sequences and Exponential Sine Sweep. An advantage of measurement with such non-localized and reproducible test signals is that ambient noise and microphone nonlinearities can be effectively averaged out. There are also some technicalities that need to be dealt with. For example, the need for synchronization of emitting and recording ends, ensuring that the test signal covers the whole length of impulse response, and the need for deconvolution, that is applying an inverse transform for recovering the impulse response.

The waveform on Fig. 5 shows another measured RIR. The initial spike at 0-3 ms corresponds to the direct sound that has arrived to the microphone along a direct path. The smaller spikes following it and starting from about 3-5 ms from the first spike clearly show several early specular reflections. After about 80 ms there are no distinctive specular reflections left, and what we see is the late reverberation or the reverberant tail of the RIR.

Fig. 5 A room impulse response. Time is shown in seconds.

While the spectrogram of RIR seems not very insightful apart from the remarks so far, there is some information one can extract from it. It shows, in particular, how the intensity of different frequencies decreases with time due to losses. For example, it is known that intensity loss due to air absorption (attenuation) is stronger for higher frequencies. At low frequencies, the spectrogram may exhibit distinct persistent frequency bands, room modes, that correspond to standing waves in the room. This effect can be seen below a certain frequency threshold depending on the room geometry, the Schroeder frequency, which for most rooms is 20 – 250 Hz. Those modes are visible due to the lower density of resonant frequencies of the room near the bottom of the spectrum, with wavelength comparable to the room dimensions. At higher frequencies, modes overlap more and more and are not distinctly visible.

RIR can also be used to estimate certain parameters associated with a room, the most well-known of them being the reverberation time or RT60. When an active sound source in a room is abruptly stopped, it will take longer or shorter time for the sound intensity to drop to a certain level, depending on the room’s geometry, materials, and other factors. In the case of RT60, the question is, how long it takes for the sound energy density to decrease by 60 decibels (dB), that is, to the millionth of its initial value. As noted by Schroeder (see the references), the average signal energy at time t used for computing reverberation time is proportional to the tail energy of the RIR, that is the total energy after time t. Thus, we can compute RT60 by plotting the tail energy level of the RIR on a dB scale (with respect to the total energy). For example, the plot corresponding to the RIR above is shown in Fig. 6:

Fig. 6 The RIR tail energy level curve.

In theory, the RIR tail energy decay should be exponential, that is, linear on a dB scale, but, as can be seen here, it drops irregularly starting at -25 dB. This is due to RIR measurement limitations. In such cases, one restricts the attention to the linear part, normally between the values -5 dB and -25 dB, and obtains RT60 by fitting a line to the measurements of RIR in logarithmic scale, by linear regression, for example.

RIR Simulation

As mentioned in the introduction, one often needs to compute a RIR for a room with given dimensions and material specifications without physically building the room. One way of achieving this would be by actually building a scaled model of the room. Then we could measure the RIR by using test signals with accordingly scaled frequencies, and rescale the recorded RIR frequencies. A more flexible and cheaper way is through computer simulations, by building a digital model of the room and modeling sound propagation. Sound propagation in a room (or other media) is described with differential equations called wave equations. However, the exact solution of these equations is out of reach in most practical settings, and one has to resort to approximate methods for simulations.

While there are many approaches for modeling sound propagation, most common simulation algorithms are based on either geometrical simplification of sound propagation or element-based methods. Element-based methods, such as the Finite Element method, rely on numerical solution of wave equations over a discretized space. For this purpose, the room space is approximated with a discrete grid or a mesh of small volume elements. Accordingly, functions describing the sound field (such as the sound pressure or density) are defined down to the level of a single volume element. The advantage of these methods is that they are more faithful to the wave equations and hence more accurate. However the computational complexity of element-based methods grows rapidly with frequency, as higher frequencies require higher resolution of a mesh (smaller volume element size). For this reason, for wideband applications like speech, element-based methods are often used to model sound propagation only for low frequencies, say, up to 1 kHz.

Geometric methods, on the other hand, work in the time domain. They model sound propagation in terms of sound rays or particles with intensity decreasing with the squared path length from the source. As such, wave-specific interference between rays is abstracted away. Thus rays effectively become sound energy carriers, with the sound energy at a point being computed by the sum of the energies of rays passing through that point. Geometric methods give plausible results for not-too-low frequencies, e.g., above the Schroeder frequency. Below that, wave effects are more prominent (recall the remarks on room modes above), and geometric methods may be inaccurate.

The room geometry is usually modeled with polygons. Walls and other surfaces are assigned absorption coefficients that describe the fraction of incident sound energy that is reflected back into the room by the surface (the rest is “lost” from the simulation perspective). One may also need to model air absorption and sound scattering by rough materials with not too small features as compared to the sound wavelengths.

Two well-known geometric methods are stochastic Ray Tracing and Image Source methods. In Ray Tracing, a sound source emits a (large) number of sound rays in random directions, also taking into account directivity of the source. Each ray has some starting energy. It travels with the speed of sound and reflects from the walls while losing energy with each reflection, according to the absorption coefficients of walls, as well as due to air absorption and other losses.

Fig. 7 Ray Tracing (only wall absorption shown).

The reflections are either specular (incident and reflected angles are equal) or scattering happens, the latter usually being modeled by a random reflection direction. The receiver registers the remaining energy, time and angle of arrival of each ray that hits its surface. Time is tracked in discrete intervals. Thus, one gets an energy histogram corresponding to the RIR with a bucket for each time interval. In order to synthesize the temporal structure of the RIR, a random Poisson-distributed sequence of signed unit impulses can be generated, which is then scaled according to the energy histogram obtained from simulation to give a RIR. For psychoacoustic reasons, one may want to treat different frequency bands separately. In this case, the procedure of scaling the random impulse sequence is done for band-passed versions of the sequence, then their sum is taken as the final RIR.

The Image Source method models only specular reflections (no scattering). In this case, a reflected ray from a source towards a receiver can be replaced with rays coming from “mirror images” of the source with respect to the reflecting wall, as shown in Fig. 8.

Fig. 8 The Image Source method.

This way, instead of keeping track of reflections, we construct images of the source relative to each wall and consider straight rays from all sources (including the original one) to the receiver. These first order images cover single reflections. For rays that reach the receiver after two reflections, we construct the images of the first order images, call them second order images, and so on, recursively. For each reflection, we can also incorporate material absorption losses, as well as air absorption. The final RIR is constructed by considering each ray as an impulse that undergoes scaling due to absorption and distance-based energy losses, as well as a distance-based phase shift (delay) for each frequency component. Before that, we need to filter out invalid image sources for which the image-receiver path does not intersect the image reflection wall or is blocked by other walls.

While the Image Source method captures specular reflections, it does not model scattering that is an important aspect of the late reverberant part of a RIR. It does not model wave-based effects either. More generally, each existing method has its advantages and shortcomings. Fortunately, shortcomings of different approaches are often complementary, so it makes sense to use hybrid models that combine several of the methods described above. For modeling late reverberations, stochastic methods like Ray Tracing are more suitable, while they may be too imprecise for modeling the early specular reflections in a RIR. One could further rely on element-based methods like the Finite Element method for modeling RIR below the Schroeder frequency where wave-based effects are more prominent.

Summary

Room impulse response (RIR) plays a key role in modeling acoustic environments. Thus, when developing voice-related algorithms, be it for voice enhancement, automatic speech recognition, or something else, here at Krisp we need to keep in mind that these algorithms must be robust to changes in acoustics settings. This is usually achieved by incorporating the acoustic properties of various room environments, as was briefly discussed here, into the design of the algorithms. This provides our users with a seamless experience, largely independent of the room from which Krisp is being used: they don’t hear the room.

Try next-level audio and voice technologies

Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.

References

[Overview of room acoustics techniques] M. Vorländer, Auralization: Fundamentals of Acoustics, Modelling, Simulation, Algorithms and Acoustic Virtual Reality. Springer, 2008.
[Overview of room acoustics techniques] H. Kuttruff, Room Acoustics (5th ed.). CRC Press, 2009.
[Signals and systems, including some Fourier analysis] K. Deergha Rao, Signals and Systems. Birkhäuser Cham, 2018.
[Exposition of simulation methods] D. Schröder, Physically Based Real-Time Auralization of Interactive Virtual Environments. PhD thesis, RWTH Aachen, 2011.
[Maximum Length Sequences for RIR measurement] M. R. Schroeder, “Integrated-impulse Method for Measuring Sound Decay without using Impulses”. The Journal of the Acoustical Society of America, vol. 66, pp. 497–500, 1979.
[Stretched impulse method for RIR measurement] N. Aoshima, “Computer-generated Pulse Signal applied for Sound Measurement”. The Journal of the Acoustical Society of America, vol. 69, no. 5, pp. 1484–1488, 1981.
[Exponential Sine Sweep technique for RIR measurement] A. Farina, “Simultaneous Measurement of Impulse Response and Distortion with a Swept-sine Technique”. In Audio Engineering Society Convention 108, 2000.
[Comparison of RIR measurement techniques] G. B. Stan, J. J. Embrechts, and D. Archambeau, “Comparison of different Impulse Response Measurement Techniques”. Journal of the Audio Engineering Society, vol. 50, pp. 249–262, 2002.
[Schroeder Integration for RT60 calculation] M. R. Schroeder; New Method of Measuring Reverberation Time. The Journal of the Acoustical Society of America, vol. 37, no. 3, pp. 409–412, 1965.
Room impulse response (RIR) datasets

The article is written by:

Tigran Tonoyan, PhD in Computer Science, Senior ML Engineer II
Hayk Aleksanyan, PhD in Mathematics, Architect, Tech Lead
Aris Hovsepyan, MS in Computer Science, Senior ML Engineer I

The post Can You Hear a Room? appeared first on Krisp.

Active Noise Cancellation Technology vs AI-based Noise Cancellation Algorithms

Krisp Research Team — Mon, 05 Jun 2023 18:17:42 +0000

With the rise of urbanization and technological advancements, noise pollution has become a prevalent issue in our modern society. However, technology has also provided us with solutions to tackle this problem. Two commonly used approaches are Active Noise Cancellation (ANC) technology, primarily used to clean the sound from the surrounding environment for hearing better (think headphones on airplanes), and AI-based noise cancellation algorithms, which excel at filtering noise from microphones (outbound stream) and headphones (inbound stream) for real-time communications. These two solutions differ significantly. ANC is directly perceived in the ear, while microphone noise canceling allows filtering out noise from the surrounding environment which passes through a microphone, enabling clear voice communications during calls or recordings.

ANC typically involves physical mechanisms to block or reduce external noises. These mechanisms can be found in various devices such as noise-canceling headphones, earplugs, and even architectural designs of buildings. They utilize microphones to capture external sounds and generate sound waves that are 180 degrees out of phase with the incoming noise, effectively canceling out the unwanted sounds.

It’s important to note that ANC often require specific hardware devices, while microphone noise-cancellation algorithms are software-based, making them adaptable to a variety of digital devices and applications. AI-based algorithms, in particular, are highly efficient, increasing their scalability and accessibility across different devices like wearables, smart speakers and smartphones.

On the other hand, AI-based microphone noise-cancellation algorithms leverage the power of artificial intelligence and machine learning to digitally analyze and process sound waves in real time, effectively removing noise from the stream. These algorithms are commonly used in software applications like voice assistants, video conferencing tools and audio editing software.

One key difference between ANC and AI-based noise-cancellation algorithms lies in the flexibility and adaptability they offer. ANC typically have a fixed design optimized for specific environments or types of noises. For example, active noise-canceling headphones excel at reducing low-frequency noises such as airplane engine sounds or traffic noises but may be less effective in canceling out higher-frequency noises like human voices or barking dogs.

In contrast, AI-based noise-cancellation algorithms are highly adaptable and can be continuously updated and optimized through learning processes. They can adjust to different types of noises and environments, making them more versatile and effective in reducing a wider range of noises. For instance, AI-based noise-cancellation algorithms in voice assistants can adapt to different accents, background noises, and speaking styles, ensuring clear and accurate voice recognition.

Another significant difference is the real-time processing capability of AI-based noise-cancellation algorithms, making them particularly valuable in applications where immediate noise reduction is crucial. In contrast, noise-cancellation (ANC) technologies may introduce a slight time delay in generating anti-noise signals.

Human and noise touchpoints

To better differentiate noise cancellation tools, it’s essential to understand the types of distortion that can affect a listener’s sound experience.

Noises are categorized into two types: noises from the real world directly reaching our ears from the surrounding environment and noises from virtual environments that reach our ears through additional devices, such as mobile phones or headphones during conference calls.

Noise from the real world

There are currently two types of headphone noise cancellation available, which can be used individually or in combination.

ANC has become a common feature in the headphone industry. However, not all ANC implementations are created equal. The quality of active noise cancellation varies based on differences in reference microphones and post-processing techniques.

In simple terms, active noise cancellation cancels out both frequencies, resulting in no audible signal. This works perfectly when both signals are identical except for one being inverted, leading to complete silence.

ANC systems utilize microphones, which can be divided into two categories: feedforward and feedback microphones.

Feedforward microphones are positioned on the outer part of the ear cups or earbuds and capture the sound waves before they enter the ear canal. They pick up environmental sounds and send them to the ANC processor, which produces a sound wave that is 180 degrees out of phase with the incoming sound, canceling it out.

Feedback microphones are positioned inside the ear cups or earbuds and capture sound that reaches the ear canal. They cancel out any sound waves that may have leaked through the seal of the ear cup or earbud. The ANC processor uses feedback microphones to adjust the sound wave produced by the feedforward microphones, enhancing the cancellation effect.

ANC systems can also utilize multiple microphones to further improve noise cancellation. For example, high-end ANC headphones may employ up to eight microphones, allowing for better distinction between desired and unwanted sounds, resulting in more accurate cancellation.

In summary, active noise cancellation systems rely on microphones to detect and cancel unwanted sounds. Multiple microphones enhance noise cancellation and provide a more enjoyable listening experience.

Here is a chart depicting different types of ANC microphones and their respective pros and cons:

Nowadays, ANC is not only used in headphones but also in cars and various home appliances. Vehicle systems capture noises using microphones or sensors placed at different parts of the car, such as acceleration and vibration sounds from the vehicle body. The captured noise is then processed by a Digital Signal Processor (DSP), which generates sound waves with opposite amplitudes to actively cancel out the noise.

Passive Noise Cancellation (Noise Isolation) is the simplest way to block noise that enters headphones from the surrounding world. It is the most widely used technology in headphone design due to its low cost. Passive cancellation techniques, also known as acoustic isolation systems, aim to prevent external sound from entering our auditory system. This is achieved by interposing insulating materials between the casing or external part of the earphone and the part attached to the ear.

Noise from virtual environments

Virtual environments are prone to background noise from various sources, including microphone feedback, background chatter, environmental sounds and audio artifacts from the virtual environment itself. These noises can disrupt the quality of virtual experiences, making it difficult for users to communicate effectively or fully immerse themselves in the virtual environment. However, advancements in artificial intelligence (AI) have opened up new possibilities for addressing this issue through AI-based noise cancellation technologies.

AI-based noise cancellation, such as Krisp’s Noise Cancellation, also referred to as AI-powered noise reduction or AI-enhanced noise suppression, utilizes AI algorithms to reduce or eliminate unwanted noise from audio signals in real time. Traditional noise cancellation methods may not be as effective in virtual environments due to the dynamic nature of virtual audio, which passes through networks and undergoes various filters and codecs. Virtual environments often involve multiple speakers with different types of microphones, sounds coming from different directions, and rapid changes.

Conclusion

In conclusion, both Active Noise Cancellation technologies and AI-based noise cancellation algorithms are effective solutions for tackling noise pollution. However, they differ in terms of flexibility, adaptability, real-time processing capabilities, and integration into different devices and applications. Having both solutions in place can make the workday more comfortable. As technology continues to evolve, we can expect further advancements in both approaches, providing us with more effective ways to reduce noise pollution and improve our quality of life.

Try next-level audio and voice technologies

Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.

References

This article was written by Andranik Alajajyan, MSc in Nuclear Physics, Senior ML QA Engineer at Krisp.

The post Active Noise Cancellation Technology vs AI-based Noise Cancellation Algorithms appeared first on Krisp.

Speech Enhancement Review: Krisp Use Case

Krisp Research Team — Mon, 13 Feb 2023 18:24:22 +0000

Imagine you have an important online meeting, and there is a lot of noise around you. Kids are playing, the dog is barking, the washing machine is running, a fan is turned on, there is construction happening nearby, and you need to join a call. More often than not, it is nearly impossible to stop the noise or find a quiet place. In such situations, we need special audio processing technology that can remove background noises to improve the quality of online meetings.

This is one of the best applications of speech enhancement. Here we will discuss speech enhancement technology, give some historical background, review existing approaches, cover the challenges surrounding real-time communication, and explore how Krisp’s speech enhancement algorithm is an ideal solution.

First, let’s define Speech Enhancement (SE). It improves the quality of a noisy speech signal by reducing or removing background noises (see Figure 1). The main goal is to improve the perceptual quality and intelligibility of speech distorted by noise.

Figure 1. Speech enhancement.

We sometimes find other terms used interchangeably with speech enhancement, such as noise cancellation (NC), noise reduction, noise suppression, and speech separation.

There are lots of applications for speech enhancement algorithms, including:

Voice communication, such as in conferencing apps, mobile phones, voice chats, and others. SE algorithms improve speech intelligibility for speakers in noisy environments, such as restaurants, offices, or crowded streets.
Improving other types of audio processing algorithms by making them more noise-robust. For instance, we can apply speech enhancement prior to passing a signal to systems like speech recognition, speaker identification, speech emotion recognition, voice conversion, etc.
Hearing aids. For those with hearing impairments, speech may be completely inaudible in noisy environments. Reducing noise increases intelligibility.

Traditional approaches

The first results of research centered around speech enhancement were obtained in the 1970s. Traditional approaches were based on statistical assumptions and mathematical modeling of the problem. Their solutions also depend largely on the application, noise types, acoustic conditions, signal-to-noise ratio, and the number of available microphones (channels). Let’s discuss them in more detail.

In general, we can divide speech enhancement algorithms into two types: multi-channel and single-channel (mono).

The multi-channel case involves two or more microphones (channels). In this case, the extra channel(s) contain information on the noise signal and can help to reduce the noise signal in the primary channel. An example of such a method is adaptive noise filtering.

This technique uses a reference signal from the auxiliary (secondary) microphone as an input to an adaptive digital filter, which estimates the noise in the primary signal and cancels it out (see Figure 2). Unlike a fixed filter, the adaptive filter automatically adjusts its impulse response. The adjustment is based on the error in the output. Therefore, with the proper adaptive algorithm, the filter can smoothly readjust itself under changing conditions to minimize the error. Examples of adaptive algorithms are least mean squares (LMS) and recursive least squares (RLS).

Figure 2. SE with adaptive noise filtering block diagram.

Another example of multi-channel speech enhancement is beamforming, which uses a microphone array to cancel out signals coming from directions other than the preferred source. Multi-channel speech enhancement can lead to promising results, but it requires several microphones and is technically difficult.
On the other hand, single-channel, or monaural speech enhancement, has a significant advantage because we don’t need to set up extra microphone(s). The algorithm takes input from only one microphone, which is a noisy audio signal representing a mixture of speech and noise, in order to remove unwanted noise.

The rest of the article is devoted to the monaural case.

One of the first results in the monaural case is the spectral subtraction method. There are various methods for this approach, but this is the idea behind the original method:

Take the noisy input signal and apply a short-time Fourier transform (STFT) algorithm
Estimate background noise by averaging the spectral magnitudes of audio segments (frames) without speech
Subtract noise estimation from spectral magnitudes of noisy frames
Then, by using the original phases of the noisy frame spectrums, apply an Inverse Short-time Fourier (ISTFT) transform to get an approximated signal of clean speech (see Figure 3).

Figure 3: Spectral subtraction block diagram.

Another classical solution is the minimum mean-square error (MMSE) algorithm introduced by Ephraim and Malah.

With the rise of machine learning (ML), several solutions have also been proposed using ML-based traditional approaches such as Hidden Markov Models (HMM), non-negative matrix factorization (NMF), and wavelet transform.

To understand the limitations of these traditional approaches, note that we can divide noise signals into two categories: stationary and non-stationary. Stationary noises have a simpler structure. Their characteristics are mainly constant over time, such as fan noise, white noise, wind noise, and river sound. Non-stationary noises have time-varying characteristics and are more widespread in real-life. They include traffic noises, construction noises, keyboard typing, cafeteria sounds, crowd noises, babies crying, clapping, animal sounds, and more. The traditional algorithms can effectively suppress stationary noises, but they have little to no effect when suppressing more-challenging non-stationary noises.

Recent advances in computer hardware and ML have led to increased research and industrial applications of algorithms based on deep learning methods, such as artificial neural networks (NN). Starting in the 2010s, neural network algorithms made tremendous progress in natural language, image, and audio processing spheres. These systems outperform traditional approaches in terms of evaluation scores. 2015 saw the first results of speech enhancement via deep learning.

Deep learning approach

Figure 4 is a typical block diagram representing monaural speech enhancement using deep learning methods.

The goal is generally as follows:

Given an arbitrary noisy signal consisting of arbitrary noise and speech signals, create a deep learning model that will reduce or entirely remove the noise signal while preserving the speech signal without any audible distortion.

Figure 4: SE using deep learning, a block diagram.

Let’s go over the main steps of this approach.

Training data: deep learning is a data-driven approach, so the end quality of the model greatly depends on the quality and amount of training data. In the case of speech enhancement, the raw training data is an audio set consisting of noisy and clean speech samples. To obtain such data we need to collect a clean speech dataset and a noise dataset. Then, by mixing clean speech and noise signals, we can artificially generate noisy/clean speech pairs as the model’s input/output data points. These are the most important aspects of data quality:
- A clean speech dataset should not contain any audible background noises
- Training voices and noises should be diverse to help the model generalize on unseen voices and noises
- It’s preferable that samples are from high-quality microphones because this gives more flexibility in data augmentations

2. Feature extraction: An example of reasonable feature extraction is an audio spectrogram or spectrogram-based features like Mel-frequency cepstral coefficients (MFCCs), which is a time and frequency representation of the signal that reflects the human auditory system’s response. As shown in Figure 5, we can visualize spectrograms as a color map of power spectrum values for time and frequency dimensions, where lighter colors mean higher values in Hz and vice versa.

Figure 5: Example of speech spectrogram.

3. Neural Network: We can tune almost any type of neural network architecture for speech enhancement. We then treat spectrograms as images in order to use image processing techniques, such as convolutional networks. We can also represent audio as sequential data, meaning that recurrent neural networks can be a proper choice in that case. This is particularly true for gated recurrent units (GRU) and long short-term memory units (LSTM).

4. Training: During the training stage, the model “learns” generic patterns of clean speech spectrums and noise spectrums to distinguish between speech and noise. This ultimately enables it to recover the speech spectrum from the noisy/corrupted input.

After the training stage, we can use the model for inference. It takes noisy audio input, extracts features, passes it to the neural network, obtains the clean speech features, and, during post-processing, recovers the clean speech signal in the output. Studies show that speech enhancement models based on deep learning are superior to traditional approaches and show significant noise reduction, not only in the case of stationary noises but also in non-stationary ones.

Krisp Noise Cancellation

Each use case dictates the SE algorithm’s specific requirements. In the case of Krisp, our mission is to provide an on-device, real-time experience to users all over the world. That’s why the model works on small chunks of the audio signal without introducing any noticeable latency and has small enough FLOPs to consume a reasonable amount of computational resources. To achieve this goal, we use custom neural network architecture and digital signal processing algorithms in the pre and post-processing stages. Our training dataset includes several thousand hours of clean speech and noise. During the training stage, we also apply various data augmentations to cover microphone diversity, acoustic conditions, signal-to-noise ratios (SNR), bandwidths, and other factors.

We’ve achieved algorithmic latency of less than 20 ms, much less than the recommended maximum real-time latency of 200ms. Our evaluations and comparisons between our algorithms and other speech enhancement technologies show superior results, both in terms of the quality of preserved voice and the amount of eliminated noise.

Try next-level audio and voice technologies

Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.

This article was written by:
Dr. Stepan Sargsyan, PhD in Mathematical Analysis. Dr. Sargsyan is an ML Architect at Krisp.

References

[1] Lim, J. and Oppenheim, A. V. (1979), Enhancement and bandwidth compression of

noisy speech, Proc. IEEE, 67(12), 1586–1604.

[2] B. Widrow et al., “Adaptive noise cancelling: Principles and applications,” in Proceedings of the IEEE, vol. 63, no. 12, pp. 1692-1716

[3] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113-120, April 1979

[4] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109-1121, December 1984

[5] M. E. Deisher and A. S. Spanias, “HMM-based speech enhancement using harmonic modeling,” 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1997, pp. 1175-1178 vol.2

[6] N. Mohammadiha, T. Gerkmann and A. Leijon, “A new approach for speech enhancement based on a constrained Nonnegative Matrix Factorization,” 2011 International Symposium on Intelligent Signal Processing and Communications Systems (ISPACS), 2011, pp. 1-5

[7] R. Patil, “Noise Reduction using Wavelet Transform and Singular Vector Decomposition”, Procedia Computer Science, vol. 54, 2015, pp 849-853,

[8] Y. Xu, J. Du, L. -R. Dai and C. -H. Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7-19, Jan. 2015

[9] Y. Zhao, B. Xu, R. Giri, and T. Zhang, “Perceptually guided speech enhancement using deep neural networks,” in 2018 IEEE Int. Conf. Acoustics Speech and Signal Processing Proc., 2018, pp. 5074-5078.

[10] T. Gao, J. Du, L. R. Dai, and C. H. Lee, “Densely connected progressive learning for LSTM-Based speech enhancement,” in 2018 IEEE Int. Conf. Acoustics Speech and Signal Processing Proc., 2018, pp. 5054-5058.

[11] S. R. Park and J. W. Lee, “A fully convolutional neural network for speech enhancement,” in Proc. Annu. Conf. Speech Communication Association Interspeech 2017.

[12] A. Pandey and D. Wang, “A new framework for CNN-Based speech enhancement in the time domain,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, July. 2019.

[13] F. G. Germain, Q. Chen, and V. Koltun, “Speech denoising with deep feature losses,” in Proc. Annu. Conf. Speech Communication Association Interspeech, 2019.

[14] D. Baby and S. Verhulst, “Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty,” in 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing Proc., 2019.

[15] H. Phan et al., “Improving GANs for speech enhancement,” IEEE Signal Process. Lett., vol. 27, 2020.

[16] P. Karjol, M. A. Kumar, and P. K. Ghosh, “Speech enhancement using multiple deep neural networks,” in 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing Proc., 2018.

[17] H. Zhao, S. Zarar, I. Tashev, and C. -H. Lee, “Convolutional-Recurrent Neural Networks for Speech Enhancement,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2401-2405.

[18] ITU-T G.114: https://www.itu.int/rec/T-REC-G.114-200305-I/en

The post Speech Enhancement Review: Krisp Use Case appeared first on Krisp.

The Hard Side of Noise Reduction – Hardware Based Approach via Beamforming

Krisp Research Team — Mon, 30 Jan 2023 10:54:00 +0000

Speech and noise

From an important conversation with a friend to a professional presentation, speech is essential to human nature. Above all else, it’s how we communicate. So it’s hard to overstate the importance of communicating in a clear and effective way.

One of the most critical aspects of effective communication is speech comprehensibility. To assess it, we use a well-known measure called speech intelligibility. Some factors that can harm intelligibility include background noise, other people talking near the speaker, and reverberation, also called room echo. Studies show that a decline in communication efficiency due to background noise and reverberation can be substantial (see for example [6] and [8], also [7] for an overview of speech communication).

Being a problem of paramount importance, myriad works are devoted to enhancing the quality of speech recorded in noisy conditions. The solutions and approaches to this problem can be roughly partitioned into two categories: software-based and hardware-based, without precluding the combination of both.

It all starts with hardware when we use a microphone to record a sound. However, under typical environmental conditions, a single microphone may perform poorly in capturing the sound of interest (e.g. the speaker’s voice) because the microphone will also pick up nearby sounds, adversely affecting speech intelligibility. To address this problem, many consumer electronic devices, such as laptops, smartphones, tablets, and headsets, may use multiple microphones to capture the same audio stream (see [1] and [2] for practical examples).

Interfering Waves

To show how several microphones might become helpful for voice enhancement, we need to understand the physics that govern sound. In its basic definition, sound is a vibration propagating through a transmission medium (e.g. air or water) as an acoustic wave. The principle of superposition states that in linear systems the net response at a given point of several stimuli is the sum of the responses of each stimulus as if they were acting separately. This principle applied to sound signals coming from several sources implies (under linear time invariance of a microphone) that the net effect recorded by a microphone will be the sum of individual sources. Essentially, this means that sound waves can combine to either amplify or reduce each other.

An interesting observation here is that, depending on the microphone’s position, the signal from some of the sources might be reduced (attenuated). Just think about two stones dropped into a pond: the two waves that are formed will add to create a wave of greater or lower amplitude, depending on where in the pond we look. This phenomenon, when two or more waves are combined, is called wave interference. We then distinguish between constructive and destructive interference, depending on whether the amplitude of the combined wave is larger or smaller than the constituent waves.

Figure 1. The orange-colored curve is the sum (the result of interference) of two sine waves of 100 Hz where one of the waves is slightly shifted in time. Observe that the amplitude of the combined wave is smaller than the sum of the amplitudes of the individual waves.

A common application of wave interference is active noise cancellation (ANC) in headsets. Here, a signal of the same amplitude as the captured noise but with an inverted phase is emitted by a noise-canceling speaker to reduce the unwanted noise. The interference between the original signal (the noise) and the generated one (with anti-phase) is destructive, resulting in noise reduction.

Enter Beamforming

By leveraging wave interference, one may then try to use several microphones to reinforce the signal coming from the direction of the main speaker (constructive interference) while suppressing signals from other directions (destructive interference). This is what beamforming does.

Audio beamforming, also known as spatial filtering, is a commonly used method for noise reduction. In a nutshell, it is a digital signal processing (DSP) technique that processes and combines sound received by a microphone array (see Figure 2) to enable the preferential capture of sound coming from specific directions. In other words, depending on which direction the microphone receives the sound from, the signal will experience constructive or destructive interference with its time-shifted replicas (see Figure 1).

The Delay and Sum Beamforming Technique

There are many beamforming algorithms. For the exposition, we will concentrate on the most basic form of it, which despite its simplicity, conveys the main ideas at the heart of the method.

Assume we have a linear uniform array of microphones, i.e. all of the microphones are arranged in a straight line with an equal distance between each other. For a source located sufficiently far from the array, we can model a sound emitting from it and reaching the microphones as a planar wave. Indeed, the sound waves propagate in spherical fronts, and the sphere’s curvature decreases as its radius extends. Thanks to this and the assumption of a distant source, we get an almost flat surface near the receiver (see Figure 2).

Figure 2. This schematic view represents a linear microphone array, where M-s stand for microphones, and d is the distance between them. The signal source, indicated in green, is located perpendicular (at 90 degrees) to the microphone array, where for the reference point on the array we take the leftmost microphone M0. The noise source, depicted in red, is seen from the array at a different angle denoted by θ. Observe that the source will reach all microphones simultaneously, while the noise arrives at the rightmost microphone first before getting to the rest in the array.

Recall that beamforming aims to capture signals from specific preferential directions and suppress signals from other directions. Let’s assume that the preferential direction is perpendicular to the microphone array. We will see later that this results in no loss of generality.

Since we assumed that the audio source is located far away from the microphone array, we can say that the waves traveling from the source to individual microphones are almost parallel, just as lines from either Los Angeles or San Diego to New York would likewise be virtually parallel. This implies that the angle between these rays and the line on which the array is positioned does not change from microphone to microphone.

Next, depending on the position of the source, the sound waves emitting from it might reach the microphones at different times. This means that while each microphone in the array records the same signal, they will be slightly shifted in time. Taking the leftmost microphone on the array as our reference, we can easily calculate the time delays with respect to the reference microphone, by using our assumption of parallel rays and relying on the fact that the speed of sound in air is constant. We then sum the reference signal (the one recorded by the leftmost microphone) with its time-shift replicas and divide the sum by the number of microphones in the array to normalize the net signal. Depending on the size of the delays and wavelength of the signal, we can get destructive interference. Note that for our direction of interest, which was fixed at 90 degrees (see Figure 2), there is no time shift in the captured signals; hence the same summing followed by the normalization procedure results in the reference signal recorded at the leftmost microphone.

The process just described is called delay and sum beamforming, and it is the simplest form of beamforming method. Our assumption of the source is at 90 degrees can be replaced by any angle. Indeed, for a given angle we can compute the expected delays described above and apply the delay corresponding to a given microphone on a signal recorded by it. This will align the signals as if they arrived simultaneously, similar to the perpendicular case. This procedure of adjusting the directionality of the array is called steering.

The microphone array’s topology also directly affects the result of the delay and sum method. To illustrate how the delay and sum algorithm works, we perform a numerical simulation for 20 microphones with an inter-microphone distance of 0.08 meters, where the input signal is a sine wave of 1000 Hz. The images below show how the array responds to this source, depending on the direction of arrival. We transform the amplitude of the resulting wave to decibel scale (dB), referring to it as the gain of the array.

Figure 3. The first image shows the gain of an array of 20 microphones placed 0.08 meters away from each other on a sine wave of 1000 Hz for angles in the range of 0 to 180. Notice that we have a peak at 90 degrees, while the rest of the angles experience attenuation. The second image depicts the same graph as above in polar coordinates, with the original amplitudes without passing to a dB scale. Here we see a beam forming at 90 degrees (cf. beamforming), and smaller beams at other degrees.

The gain of the array varies with frequency. The graphs below show gains of frequencies ranging from 0 to 8000 Hz for arrays of 10, 20, and 40 microphones respectively, arranged in a line with 0.08 meters of inter-microphone distance. Remarkably, the array cannot suppress specific signals of high frequencies, although the arrival angle differs from the preferred one. This phenomenon, called spatial aliasing, happens when a linear array detects signals whose wavelengths are not long enough compared to the distance between array elements.

Figure 4. The gain of a source signal for frequencies ranging from 0 to 8000 Hz for a linear array of microphones with 10, 20, and 40 elements respectively, with a 0.08-meter distance between microphones. The lighter the color, the more significant the gain, while darker colors indicate attenuation. Notice that there is a light-colored sector around 90 degrees. However, there are also lighter sections at other angles for higher frequencies due to spatial aliasing. Observe also that the larger the number of microphones, the stronger the concentration around the preferred direction.

Spatial Filtering in Human Auditory System

One of the applications of beamforming technology is sound source localization, which estimates the position of one or many sound sources relative to a given reference point. Remarkably, this happens all the time in our ears.

We’re able to tell where a sound is coming from when we hear it because of a sound localization process that is partially based on binaural cues. More precisely, it’s because of the differences in the arrival time or the intensity of the sounds at the left and right ears. Such differences are used mainly for left-right localization in space.

Beamforming Beyond Sound

Beamforming methods are abundant. Some variations of the method discussed above include amplitude scaling of each signal before the summation stage, setting up microphones in a rather complex arrangement instead of a linear placement, adapting the response of a beamformer automatically (adaptive beamforming), etc.

Beamforming methods also work in the reverse mode, i.e. for signal transmitting instead of receiving. A wide range of applications goes beyond acoustics, as beamforming is employed in radar, sonar, seismology, wireless communications, and more. The interested reader may consult [3], [4], [5], and the references therein for more information.

Audio Enhancement Beyond Beamforming

Although audio beamforming may be effective at reducing noise and increasing speech intelligibility, it relies on specialized hardware. Plus, beamforming alone, does not guarantee the complete elimination of background noises, so the output of spatial filtering usually undergoes further processing before transmission.

At Krisp, we approach the problem of speech enhancement in noisy conditions from a pure software perspective. While we recognize the benefits that spatial filtering can offer, we prioritize the flexibility of deploying and utilizing our solutions on any device, no matter whether it has many microphones or just one.

Try next-level audio and voice technologies

Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.

References

[1] Dusan, S.V., Paquier, B.P. and Lindahl, A.M., Apple Inc, 2016. System and method of improving voice quality in a wireless headset with untethered earbuds of a mobile device. U.S. Patent 9,532,131.

[2] Iyengar, V. and Dusan, S.V., Apple Inc, 2018. Audio noise estimation and audio noise reduction using multiple microphones. U.S. Patent 9,966,067.

[3] Kellermann, W. (2008). Beamforming for Speech and Audio Signals. In: Havelock, D., Kuwano, S., Vorländer, M. (eds) Handbook of Signal Processing in Acoustics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-30441-0_35

[4] Li, J. and Stoica, P., 2005. Robust adaptive beamforming. John Wiley & Sons.

[5] Liu, W. and Weiss, S., 2010. Wideband beamforming: concepts and techniques. John Wiley & Sons.

[6] Munro, M. (1998). THE EFFECTS OF NOISE ON THE INTELLIGIBILITY OF FOREIGN-ACCENTED SPEECH. Studies in Second Language Acquisition, 20(2), 139-154. doi:10.1017/S0272263198002022 https://www.cambridge.org/core/journals/studies-in-second-language-acquisition/article/abs/effects-of-noise-on-the-intelligibility-of-foreignaccented-speech/67D7F1E4248B019B77DA46825DB70E74#

[7] O’Shaughnessy. Speech communications: Human and machine. 1999.

[8] Puglisi, G.E., Warzybok, A., Astolfi, A. and Kollmeier, B., 2021. Effect of reverberation and noise type on speech intelligibility in real complex acoustic scenarios. Building and Environment, 204, p.108-137.

https://www.sciencedirect.com/science/article/abs/pii/S0360132321005382

The article is written by:

Hayk Aleksanyan, PhD in Mathematics, Architect, Tech Lead
Hovhannes Shmavonyan, PhD in Physics, Senior ML Engineer I
Tigran Tonoyan, PhD in Computer Science, Senior ML Engineer II
Aris Hovsepyan, BSc in Computer Science, ML Engineer II

The post The Hard Side of Noise Reduction – Hardware Based Approach via Beamforming appeared first on Krisp.

Noise Cancellation Quality Evaluation: How We Test Krisp’s Technology

Krisp Research Team — Fri, 23 Sep 2022 12:47:00 +0000

The power of Krisp lies in its AI-based noise-cancellation algorithm—which we are constantly perfecting. Noise cancellation comes with distinct technical challenges: Namely, how do we remove distracting background noise while preserving human voice? If the app overcorrects on noise removal, the speaker’s voice will sound robotic. If it overcorrects on voice preservation, unwanted sounds will remain.

In each node of the audio pipeline, there are standardized methods for measuring the quality. In this blog, we’re going to deep dive into the nuances and challenges of software-based real-time working Noise Cancellation (NC) quality evaluations. We’ll show you how we do algorithm testing here at Krisp. But first, let’s begin with the basics.

What affects sound quality?

Our perception

Hearing is a physiological process and is based on human perception. Sound waves reach your ears and create an auditory perception that varies depending on the medium through which it traveled and the way your particular ears are structured. For example, you will perceive the same sound differently in the air versus in water. Another example is computer-generated audio that some people hear as “yanny” and others as “laurel.” Check it out below. What do you hear?

So, sound quality is based on our perception, which leads to a strong subjectivity factor.

Microphones and speakers

Another factor affecting overall sound quality is the audio recording and reproduction device used. Microphones can introduce specific audible and inaudible distortions like clipping, choppy voice, or suppressed frequency ranges. Moreover, they capture background noises and voices along with the main speaker’s voice. Though there are devices that have built-in noise cancellation, these kinds of hardware solutions are often not affordable for everyone or don’t eliminate the background noise completely.

In contrast, Krisp provides an AI-based software solution that increases the call quality with any device by effectively identifying and removing unwanted sounds. Developing and perfecting this app has unique challenges. Let’s dive into how we at Krisp test our noise-cancellation algorithm.

3 testing challenges of noise-cancellation algorithms

1. AI-based system testing is different from non-AI-based system testing.

The first challenge of NC algorithm quality assurance is that we’re dealing with an AI-based system. AI testing, in general, is different from non-AI-based software testing. In the case of non-AI-based software testing, the test subject is the predefined desired behavior. In the case of AI testing, however, the test subject is the logic of the system. See Figure 1 below.

Figure 1: Testing flow of Software and AI systems

2. Noise cancellation has tradeoffs.

The second challenge is that we always need to consider the possible tradeoffs of the system.

NC aggressiveness

One such tradeoff is the NC aggressiveness, which is the level of noise cancellation we apply. If the level is too low, distracting sounds still come through. If the level is too high, the speaker’s voice sounds robotic. The NC algorithm must find the golden mean where it eliminates all of the background noises while preserving the speaker’s voice.

Resource usage vs. quality

Another tradeoff is between resource usage and quality. Normally, the more extensive the neural network, the better work it will do. But the machines that are meant to run real-time working NC algorithms are supposed to have limited resources. So we need to assess consumed resources, like CPU usage, memory allocation, and power consumption to verify it’s working on-device with the expected quality. In addition, SDKs doing real-time noise cancellation need to be verified on multiple platforms, such as Mac, Windows, and iOS.

3. Audio quality can be affected by microphones and at multiple points along the journey.

The third challenge of NC algorithm quality assurance is that microphone devices can add their own distortions. Plus, there are other factors impacting the audio quality as well. Before reaching the other side of the conferencing call, audio makes a long journey (see Figure 2). Each step could add its own degradations. A network can cause packet loss and other artifacts, which may lead to degradations in the output signal. On top of that, the sound-amplifying device in the final point should be qualified enough to replicate the audio with high fidelity.

Figure 2: Audio Signal Processing Pipeline

Speech quality testing: Subjective evaluations

The straightforward approach to Noise Cancellation quality evaluation is to listen to its output—in technical terms, conduct subjective evaluations. People of different ages, ear structures, or even having different moods may perceive the same piece of audio differently. No doubt, this leads to a subjective bias. To decrease that bias as much as possible, the standard process for subjective tests has been introduced.

ITU-R BS.1284-2 establishes the recommended standardized setup for conducting subjective tests. To get accurate results, you need to have diverse listeners as much as possible. More opinions, lower bias. The listeners rate the quality of audio on a scale of 1 to 5, where 5 corresponds to the highest possible quality and 1 to the lowest.

In our case, the score represents a healthy average of intelligibility of the voice and audibility of residual noise (if any). The mean opinion score (MOS) comes from the arithmetic mean of all of the listeners’ scores.

Quality of the Audio

MOS Score

Excellent

Good

Fair

Poor

Bad

Figure 3: MOS score rating description

Speech quality testing: Objective evaluations

Though subjective tests are precise, they can be very costly and time-consuming, as they require a lot of human, time, and financial resources. Consider a situation having many variants of an NC algorithm with two or more NC algorithms under test. Conducting subjective evaluations for the algorithms may delay the early feedback, making it impossible to deliver the algorithms promptly.

To overcome these issues, we’re also considering objective evaluation metrics. Some of these metrics tend to be highly correlated with subjective scores. Unlike subjective scores, objective evaluations are repeatable. No matter how many times you evaluate the same algorithm, you will get the same scores—something that is not guaranteed with subjective metrics. Objective evaluations make it possible to evaluate many NC algorithms in the same conditions without wasting extra effort and time.

There are a few objective evaluation metrics designed for different use cases. Each has its own logic for audio assessment. Let’s review some of the metrics that we’re currently considering.

PESQ (Perceptual Evaluation of Speech Quality) is the most used metric in Research, though it’s not for NC quality evaluations explicitly, but for measuring speech quality after passing through the network and codec-related distortions. It is standardized as Recommendation ITU-T P.862. Its result represents the mean opinion score (MOS) that covers a scale from 1 (bad) to 5 (excellent).

POLQA (Perceptual Objective Listening Quality Analysis) is an upgraded version of PESQ that provides an advanced level of benchmarking accuracy and adds significant new capabilities for super-wideband (HD) and full-band speech signals. It’s standardized as Recommendation ITU-T P.863. POLQA has the same MOS scoring scale as its predecessor PESQ, though it’s for the same use case of evaluating quality related to codec distortions.

3QUEST (3-fold Quality Evaluation of Speech in Telecommunications) was designed to assess the background noise separately in a transmitted signal. Thus, it returns Speech-MOS (S-MOS), Noise-MOS (N-MOS), and General-MOS (G-MOS) values on a scale of 1 (bad) to 5 (excellent). G-MOS is a weighted average of the other two. It’s standardized as Recommendation ITU-T P.835. Based on our experience, 3QUEST is the most suitable objective metric for NC evaluations.

NISQA (Non-intrusive Objective Speech Quality Assessment) is a relatively new metric. Unlike the above-listed metrics, it doesn’t require the reference clean speech audio file. It’s standardized from ITU-T Rec. P.800 series. Besides MOS score, NISQA provides a prediction of the four speech quality dimensions: Noisiness, Coloration, Discontinuity, and Loudness.

How we generate test datasets

A straightforward approach to generating test datasets is to mix the clean voice and noise with the desired SNR (Signal-to-Noise Ratio) using an audio editor or other tools. Though these recordings don’t simulate the real use case and are artificial, we use such recordings in the initial testing phases as they are easy to collect and can catch obvious quality issues in the early stages of development.

To have more accurate recordings for qualitative tests and evaluations, we’re collaborating with well-equipped audio labs to get recordings with predefined use cases. A real use case is being simulated in ETSI rooms and being recorded with high-quality microphones, where:

The spoken sentences are phonetically balanced, intended to cover all possible sounds in a certain language.
There are a few speakers in the same recordings, intended to cover speaker-independency tests.
Recordings are done with different languages, intended to cover language independency tests.
They are simulating various noises with different noise levels and recording them along with voice with the same mic.

As you can see, a lot of use cases are covered with the above-mentioned datasets. However, to ensure the Noise Cancellation algorithmic quality, we still need to consider some test scenarios that are outside of these recordings’ scope. To fill that gap, we also perform in-house recordings.

To ensure Krisp works with any device, we need to test it with a lot of devices. However, testing with all possible devices is practically impossible and may not be necessary. Instead, we’ve identified the top devices used by our users and targeted the testing on those devices. Currently, we have an in-house “device farm” with almost 50 mics, and we’re continuously adding the latest available mics.

Figure 4: Some microphones from our test set

As a part of the NC algorithm, Krisp also removes the room echo, or in more technical terms, reverberation. Hence we’re considering a lot of rooms with different acoustic setups to guarantee meaningful coverage of reverberant cases.

Noise cancellation and speech quality: The eternal tradeoff

Again, when evaluating the quality of Krisp’s NC algorithm, we must strike the right balance between removing background noise and preserving the speaker’s voice. There is an eternal tradeoff between these two criteria.

To gain a complete picture, we’re considering different test scenarios/datasets along with corresponding applicable metrics. There is no single metric reflecting the best representative value for the quality assessment. Each evaluation metric is designed for certain use cases; they are complementing each other rather than replacing. Testing an NC algorithm with several evaluation metrics allows us to have a more comprehensive picture to assess the audio quality from different standpoints.

Try next-level audio and voice technologies

Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.

Sources:

This article is written by:

Ani Chilingaryan, MS in Computer Science, CTFL | QA Manager, Research

The post Noise Cancellation Quality Evaluation: How We Test Krisp’s Technology appeared first on Krisp.

On-Device Meeting Transcription and Speech Recognition Testing

Krisp Research Team — Mon, 12 Sep 2022 18:51:55 +0000

We are inevitably going to have even more online meetings in the future. So it’s important to not get lost or lose context with all the information around us.

Just think about one of last week’s meetings. Can your team really recall everything discussed on a call??

When no one remembers exactly what was discussed during an online meeting, the effectiveness of these calls are reduced materially.

Many meeting leaders take hand-written notes to memorialize what was discussed to share and assign action items. However, the act of manually taking notes causes the note-taker to not be fully present during the call, and even the best note-takers miss key points and important dialogue.

Fixing meeting knowledge loss

In 1885, Hermann Ebbinghaus claimed that humans begin to lose recently acquired knowledge within hours of learning it. This means that you will recall the details of a conversation during the meeting, but, a day later, you’ll only remember about 60% of that information. The following day, this drops to 40% and keeps getting lower until you recall very little of the specifics of the discussion.

The solution?

Automatically transcribe meetings so that they can be reviewed and shared after the call. This approach helps us access important details discussed during a meeting, allowing for accurate and timely follow up and prevent misunderstandings or missed deadlines due to miscommunication.

Many studies have showed that people are generally more likely to comprehend and remember visual information as opposed to when they’re taking part in events/meetings that solely rely on audio for information sharing.

Meeting transcriptions provide participants with a visual format of any spoken content. It also allows attendees to listen and read along at the same time. This makes for increased focus during meetings or events, and improved outcomes post-meeting.

On-device processing

Having meeting transcript technology available to work seamlessly with all online meeting applications allows for unlimited transcriptions without having to utilize expensive cloud services.

At Krisp, we value privacy and keep the entire speech processing on-device, so no voice or audio is ever processed or stored in the cloud. This is very important from a security perspective, as all voice and transcripted data will be on-device under the users control.

It’s quite a big challenge to make transcription technologies work on-device due to constrained compute resources available vs cloud-based servers. Most transcription solutions are cloud-based and don’t deliver the accuracy and privacy of an on-device solution.. On-device technologies need to be optimized to operate smoothly, with specific attention to:

Package size
Memory footprint
CPU usage (calculated using the real-time factor (RTF), which is the ratio of the technology response time to the meeting duration)

So every time we test a new underlying speech model, we first ensure that it is able to operate within the limited resources available on most desktop and laptop computers.

Technologies behind meeting transcripts

At first glance, it looks like the only technology behind having readable meeting transcriptions is simply converting audio into text.

But there are two main challenges with this simplistic approach:

First, we don’t explicitly include punctuation when speaking like we do when writing. So we can only guess punctuation from the acoustic information. Studies have found that transcripts without punctuation are even more detrimental to understanding the information than a word error rate of 15 or 20%. So we need a separate solution for adding punctuation and capitalization to text.
The second challenge is to distinguish between texts spoken by different people. This distinction improves the readability and understanding of the transcript. Differentiating between separate speakers is typically performed with a separate technology from core ASR, as there are hidden text-dependent acoustic features inside speech recognition models. In this case, text-independent speaker features are needed.

To summarize, meeting transcription technology consists of three different technologies:

ASR (Automatic Speech Recognition): Enables the recognition and translation of spoken language into text. Applying this technology on an input audio stream gives us lowercase text with only apostrophes as punctuation marks.
Punctuation and capitalization of the text: Enables the addition of capitalization, periods, commas, and question marks.
Speaker diarization: Enables the partitioning of an input audio stream into homogeneous segments according to every speaker’s identity.

The diagram below represents the process of generating meeting transcripts from an audio stream:

As mentioned above, all of these technologies should work on a device, even when the device is running on low CPU usage. For good results, a proper testing mechanism is needed for each of the models. Metrics and datasets are the key components of this type of testing methodology. Each of the models has further testing nuances that we’ll discuss below.

Datasets and benchmarks for speech recognition testing

To test our technologies, we use custom datasets along with publicly available datasets such as Earnings 21. This gives us a good comparison with both open source benchmarks and those provided through research papers.

For gathering the test data, we first define the use cases and collect data for each one. Let’s take online meetings as the main use case. Here we need conversational data for testing purposes. Moreover we need to consider that Plus, we’ll perform a competitor evaluation on the same data to see advantages and identify possible improvement areas for Krisp’s meeting transcription technology.

Testing the ASR model

Metrics

The main testing metric of the ASR model is the WER (Word Error Rate), which is being computed based on the reference labeled text and the processed one in the following way:

Datasets

After gathering custom conversational data for the main test, we augment it by adding:

Noises at the signal-to-noise ratio (SNR) with dBs 0, 5, and 10.
Reverberations with time from 100ms to 900ms.
Low-pass and high-pass filters to simulate low-quality microphones.
- For this scenario, we’re also using the Earnings 21 dataset because its utterances have very low bandwidth
Speech pace modifications.

We also want our ASR to support accents such as American, British, Canadian, Indian, Australian, etc.

We’ve collected data for each of those accents and calculated the WER, comparing the results with competitors.

Testing the punctuation and capitalization model

The main challenge of testing punctuation is the subjectivity factor. There can be multiple ways of rendering punctuation and all of them can be true. For instance, adding commas and even deciding on the length of a sentence depends on the grammar rules you want to use.

Metrics

The main metrics for measuring accuracy here are Precision, Recall, and the F1 score. These are calculated for each punctuation mark and capitalization instance.

Precision: The number of true predictions of a mark divided by the total number of all predictions of the same mark.
Recall: The number of true predictions of a mark divided by the total number of a mark.
F1 score: The harmonic mean of precision and recall.

Datasets

Since we use the punctuation and capitalization model on top of ASR, we have to evaluate it on a text with errors. Taking this into account, we run our ASR algorithm on the meeting data we collected. Then, linguists manually punctuate and capitalize the output texts. Using these as labels, we’re ready to calculate the three above-mentioned metrics [Precision, Recall, F1 score].

Testing the speaker diarization model

Metrics

The main metrics of the speaker diarization model testing are:

Diarization error rate (DER), which is the sum of the following error rates
- Speaker error: The percentage of scored time when a speaker ID is assigned to the wrong speaker. This type of error doesn’t account for speakers when the overlap isn’t detected or if errors from non-speech frames occur.
- False alarm speech: The percentage of scored time when a hypothesized speaker is labeled as non-speech.
- Missed speech: The percentage of scored time when a hypothesized non-speech segment corresponds to a reference speaker segment.
- Overlap speaker: The percentage of scored time when some of the multiple speakers in a segment don’t get assigned to any speaker.
Word diarization error rate (see the Joint Speech Recognition and Speaker Diarization via Sequence Transduction paper), which calculates as:

Datasets

We used the same custom datasets as for the ASR models. We’ve made sure that the number of speakers varies a lot from sample to sample in this test data. Also, we performed the same augmentations like with ASR testing.

Conclusions on speech recognition testing

On-device meeting transcription combines three different technologies. All of them require extensive testing considering they should be hosted on a device.

The biggest challenges are choosing the right datasets and the right metrics for each use case as well as ensuring that all technologies run on the device without impacting other running processes.

Try next-level audio and voice technologies

Krisp licenses its SDKs to developers to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.

This article is written by:

Vazgen Mikayelyan, PhD in Mathematics | Machine Learning Architect, Tech Lead

The post On-Device Meeting Transcription and Speech Recognition Testing appeared first on Krisp.

Speech Quality Measurement Algorithms and Testing Technology

Krisp Research Team — Mon, 29 Aug 2022 19:11:45 +0000

Estimating the quality of speech is a central part of quality assurance in any system for generating or transforming speech.

Such systems include telecommunication networks or speech processing or generating software. The speech signal in their output suffers from various degradations inherent to the particular system.

With background noise cancellation, an algorithm could leave remnants of noise in an audio snippet or partially suppress speech (see Fig. 1). Or, a telecommunication system could also suffer from packet loss and delays. Meanwhile, audio codecs can introduce unwanted coloration just like speech-to-text systems deliver unnatural sounds.

A more straightforward approach to speech quality testing is conducting listening sessions, resulting in subjective quality results.

A group of people listens to recordings under controlled settings. They’re then asked to provide normalized feedback (e.g. on a scale from 1 to 5). Responses are then aggregated into a single quality value, called Mean Opinion Score (MOS).

Aggregation is necessary to avoid subject bias. There are standard guidelines for conducting such listening tests and for further statistical analysis of the results that yield the MOS values (see ITU-T P.830, ITU-R BS.1116, ITU-T P.835).

Fig. 1 An example of a degraded speech signal and its reference

In some circumstances, conducting listening sessions for collecting MOS is infeasible, laborious, or costly due to the large volume of the data to be tested.

This is where automated speech perceptual quality measures come into play. The aim is to replicate the way humans evaluate speech quality, avoiding subject bias inherent with subjective listening panels.

In short, the goal is to objectively approximate or predict voice quality in the presence of noise. The perceptual speech quality estimation is assessed in two cases, depending on whether clean reference speech can be compared with the system output or not (depicted in Fig. 2).

Fig. 2 Schematic view of intrusive and non-intrusive measures (source: ITU-T P.563)

Double-ended (intrusive) speech quality assessment measures

The model, in this case, has access to both the reference and output audio of the speech processing system. Its score is given based on the differences between the two. Please note that model and measure are used interchangeably in this article.

To mimic a human listener, automated methods should “catch” speech impairments that are detectable by the human ear/brain and assign them a quantitative value. It’s not enough to compute only mathematical differences between the audio samples (waveforms).

Such algorithms need to model the human psychoacoustic system. In particular, they’re expected to capture the ear functionality as well as certain cognitive effects.

To mimic human hearing capability, the model obtains an “internal” representation of the signals that mimics the transformations happening within the human ear. Signals of different frequencies are detected by separate areas of the cochlea (see Fig. 3). This “coverage” of frequencies isn’t linear. As a consequence, the audible spectrum can be partitioned into frequency bands of various widths the ear perceives as equal widths.

This phenomenon is modeled by psychoacoustic scales such as Bark and Mel scales (e.g. PESQ, NISQA double-ended, and PEAQ basic version), or by filter banks like the Gammatone filter bank (e.g. PEMO-Q, ViSQOL, PEAQ advanced version).

Fig. 3 Schematic view of auditory system (source)

Other voice quality aspects to consider are:

The way loudness is perceived – this depends on the frequency structure of the sound, as well as its duration.
The absolute hearing threshold – another time/frequency-dependent element, which can be between 2 and 3 kHz and is usually lower than that for other frequencies
Masking effects – a loud sound may make a weaker sound inaudible when the latter occurs simultaneously or shortly after the former.

For audio quality testing in the case of telecommunication networks, it’s important to take into account missing signal fragments due to lost packets. Several models (e.g., PESQ, POLQA, NISQA) incorporate intricate alignment mechanisms that may take up the bulk of their computation.

Take PESQ alignment procedures as an example. These are based on signal cross-correlation between speech fragments while NISQA uses Attention networks for this.

Having computed an internal representation of reference speech and output signals, the model then works out the value of an appropriate function. The latter is designed to measure the difference between the two representations, mapping the result to a MOS value.

Quantifying the difference between internal representations is one of the main distinguishing factors of various quality measures. This may include further modeling of cognitive effects or resorting to pre-trained deep learning models (e.g. NISQA).

As an example, when evaluating loudness differences between signal fragments, PESQ gives higher penalty to missing fragments of speech than to additive noise. That’s due to the former being perceived as more disturbing to the listener.

Single-ended (non-intrusive) speech quality measures

In this case, the test algorithm only evaluates the output of the system without access to the reference speech. These metrics check if the given sound fragment is indeed human speech and whether it’s of good quality.

To achieve this, the way speech is produced by the human vocal tract should be modeled (see Fig. 4), along with the modeling of the auditory system. This appears to be a much more complex task than modeling the auditory system alone, as it involves more components and parameters. Additionally, the model needs to detect noise and missing speech fragments as well.

Fig. 4 Schematic view of human speech production system (source)

Some speech quality assessment methods do such modeling explicitly. For instance, the algorithm from the ITU-T P.563 standard estimates parameters of the human speech production system from a recording.

Using this, the algorithm generates a clean version of the reference speech signal. The result is then compared with the output signal using an intrusive quality measure like PESQ.

As mentioned earlier, such objective algorithmic models rely on a considerable number of hand-crafted parameters. This means this problem is a good candidate for machine learning methods.

There have been quite a few recent works approaching single-ended audio quality estimation using Deep Neural Nets (DNN) and other machine learning techniques. The DNN usually consists of a convolution-based feature extraction phase and a subsequent higher-level analysis phase where LSTM or Attention-based modules are used (e.g. NISQA, SESQA, MOSNet) or simply dense layers or pooling (e.g. CNN-ELM, WAWEnets). The result is then mapped to a MOS prediction value.

Another approach that’s been used with machine learning was, similar to ITU-T P.563, to synthesize a clean pseudo-reference version of the system output speech using machine learning (e.g. by using Gaussian Mixture Models to derive noise statistics from the noisy output of the system and compensate for the noise) and then compare it with the output speech using an intrusive method (see ref. 16).

Data generation for DNN models involves introducing speech degradations typical to the target use scenario. For example, applying various audio codecs to speech samples, mixing with noise, simulating packet loss and applying low/high-pass filters.

For supervised training, the training data needs to be annotated. The two main approaches here are either to annotate the data using known intrusive quality measures (e.g. SESQA, Quality-Net, WAWEnets) or conducting listening sessions for collecting MOS scores (MOSNet, NISQA, AutoMOS, CNN-ELM). In fact, models from the former group annotate the data using several intrusive measures, with the aim of smoothing out the shortcomings of particular measures.

Applicability for speech quality measures

Each speech quality measure was fitted to some limited data in the design stage and developed with usage scenarios and audio distortions in mind.

The specific application scenarios of perceived audio quality measures range considerably across different fields, such as telecommunication networks, VoIP, source speech separation or noise cancellation, speech-to-text algorithms, and audio codec design. This is true for both algorithmic and machine learning-based models, raising the question of cross-domain applicability of models.

From this point of view, DNN models crucially depend on the data distribution they’re trained on. On the other hand, algorithmic models are based on psychoacoustic research that doesn’t directly rely on any dataset.

The parameters of algorithmic measures are tuned to fit the measure output to MOS values of some dataset, but this dependence seems less crucial than for DNN models.

There are algorithmic measures that have been designed to be general quality measures, such as PEMO-Q. This phenomenon is well-reflected in a recent study (see ref. 18) that examines domain dependence of intrusive models with respect to audio coding and source separation domains. Among other things, they found out that standards like PESQ and PEAQ fare very well across these domains. This continues to happen although they weren’t designed for source separation. For PEAQ, one needs to take a re-weighted combination of only a subset of multiple output values to achieve good results.

Another aspect of usability is bandwidth dependence. This limitation is often specific to algorithmic measures.

While it’s rather easy to simulate audio instances of various bandwidths during data generation in DNNs, algorithmic models need explicit parameter tuning to give dependable outputs for different bandwidths. For example, the original ITU-T standard for PESQ was designed specifically for narrowband speech (then extended to support wideband), while its successor, POLQA, supports full-band speech.

A final consideration for speech and audio quality testing is performance. This may show up in massive and regular testing. DNN models can benefit from optimized batch processing, while multiprocessing can be applied with other measures.

Performance can be further improved if models were modular so that one could turn off certain functionality that isn’t necessary for a given application. For instance, we could improve the performance of the intrusive NISQA model in a noise cancellation application by removing its alignment layer (which isn’t necessary in this situation).

This was a short glimpse into the key points of objective speech quality measurement and prediction. This is an active research area with many facets that can’t be fully covered in a brief post. Please review the publications below for a more detailed description of signal to noise ratio measure and other speech quality assessment algorithms.

Try next-level voice and audio technologies

Krisp rigorously tests its voice technologies utilizing both objective and subjective methodologies. Krisp licenses its SDKs to developers to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.

The article is written by:

Tigran Tonoyan, PhD in Computer Science, Senior ML Engineer II
Aris Hovsepyan, BSc in Computer Science, ML Engineer II
Hovhannes Shmavonyan, PhD in Physics, Senior ML Engineer I
Hayk Aleksanyan, PhD in Mathematics, Principal ML Engineer, Tech Lead

References:

ITU-T P.830: https://www.itu.int/rec/T-REC-P.830/en
ITU-R BS.1116: https://www.itu.int/rec/R-REC-BS.1116
ITU-T P.835: https://www.itu.int/rec/T-REC-P.835/en
PESQ: ITU-T P.862, https://www.itu.int/rec/T-REC-P.862
NISQA: G. Mittag et al., NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets, INTERSPEECH 2021 (see also the first author’s PhD thesis)
PEAQ: ITU-R BS.1387, https://www.itu.int/rec/R-REC-BS.1387/en
PEMO-Q: R. Huber and B. Kollmeier, PEMO-Q – A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception, IEEE Transactions on Audio, Speech, and Language Processing, 14 (6)
ViSQOL: A. Hines et al., ViSQOL: The Virtual Speech Quality Objective Listener, IWAENC 2012
ITU-T P.563: https://www.itu.int/rec/T-REC-P.563/en
POLQA: ITU-T P.863, https://www.itu.int/rec/T-REC-P.863
ANIQUE: D.-S. Kim, ANIQUE: an auditory model for single-ended speech quality estimation, IEEE Transactions on Speech and Audio Processing 13 (5)
SESQA: J. Serrà et al., SESQA: Semi-Supervised Learning for Speech Quality Assessment, ICASSP 2021
MOSNet: C.-C. Lo et al., MOSNet: Deep Learning-Based Objective Assessment, INTERSPEECH 2019
CNN-ELM: H. Gamper et al., Intrusive and Non-Intrusive Perceptual Speech Quality Assessment Using a Convolutional Neural Network, WASPAA 2019
WAWEnets: A. Catellier and S. Voran, Wawenets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality, ICASSP 2020
Y. Shan et al., Non-intrusive Speech Quality Assessment Using Deep Belief Network and Backpropagation Neural Network, ISCSLP 2018
Quality-Net: S. Fu et al., Quality-Net: An end-to-end non-intrusive speech quality assessment model based on BLSTM, INTERSPEECH 2018
M. Torcoli et al., Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence, IEEE/ACM Transactions on Audio, Speech, and Language Processing 29
ETSI standard EG 202 396-3: https://www.etsi.org

The post Speech Quality Measurement Algorithms and Testing Technology appeared first on Krisp.