The post Elevate Your Contact Center Experience with Krisp Background Voice Cancellation (BVC) appeared first on Krisp.
]]>Krisp BVC is an advanced AI noise-canceling technology that eliminates all background noises and other competing voices nearby, including the voices of other agents. This breakthrough technology is enabled as soon as an agent plugs in their headsets, without requiring individual voice enrollment or training. This innovative solution integrates smoothly with both native applications and browser-based calling applications via WebAssembly JavaScript (WASM JS), ensuring high performance and efficiency.
Customers often struggle with understanding agents when there’s background chatter, leading to frustration and reduced satisfaction. By using Krisp BVC, all extraneous voices and noises are filtered out, allowing customers to focus solely on the agent they are speaking with. This ensures a smooth and professional interaction every time, which directly contributes to higher CSAT scores.
In a contact center, the risk of customers overhearing personal information from other calls is a significant concern, especially for financial and healthcare customers. Krisp BVC addresses this by completely isolating the agent’s voice from the background, ensuring that sensitive information remains confidential.
While headsets and other hardware solutions provide some noise reduction, they do not eliminate background voices. Krisp BVC works independently of hardware, offering superior noise and background voice cancellation without the need for additional devices or complicated setups.
Once the agent’s headset is plugged in, Krisp BVC is activated automatically. There’s no need for agents to enroll their voice or go through any training process, making it an effortless solution that saves time and resources.
Krisp BVC is uniquely available for both native applications and browser-based calling applications through WASM JS. This means it can be integrated effortlessly into various platforms, ensuring consistent performance and reliability.
Krisp BVC is designed to run efficiently in the browser, making it an ideal solution for Contact Center as a Service (CCaaS) platforms. Its high-performance capabilities ensure minimal latency and a smooth user experience.
With the enhanced clarity of communication provided by Krisp BVC, customers are more likely to have positive interactions with agents. This leads to increased satisfaction, as reflected in improved CSAT metrics reported to us by a number of customers. Clear and effective communication is crucial in resolving issues promptly and accurately, which in turn boosts customer loyalty and satisfaction.
Integrating Krisp BVC into your contact center application is straightforward. Here’s a sample code snippet to demonstrate how simple it is to get started:
The graphical representation above illustrates the clarity and focus achieved by using Krisp BVC. Notice how the agent’s speech is clear and distinct, free from background distractions.
Experience the transformative power of Krisp BVC with this audio comparison:
Without BVC – Competing Agent Voices
With BVC – Clear communication
Integrating Krisp BVC into your contact center solutions can significantly enhance the quality of interactions and customer satisfaction. Its ease of integration, combined with superior performance and versatility, makes Krisp BVC a must-have feature for modern contact centers. Upgrade your communication systems today with Krisp Background Voice Cancellation and experience the difference it makes, including improved CSAT metrics.
Ready to get started? Visit Krisp’s Developer Portal for more information and comprehensive integration guides.
The post Elevate Your Contact Center Experience with Krisp Background Voice Cancellation (BVC) appeared first on Krisp.
]]>The post Enhancing Browser App Experiences: Krisp JS SDK Pioneers In-browser AI Voice Processing for Desktop and Mobile appeared first on Krisp.
]]>In today’s connected world, where web browsers serve as gateways to an assortment of online experiences, ensuring a seamless and productive user experience is paramount. One crucial aspect often overlooked in browser-based communication applications is voice quality, especially in scenarios where clarity of communication is essential.
From virtual meetings and online classes to contact center operations, the demand for clear audio communications has become ever more important, making AI Voice processing with noise and background voice cancellation an expected and highly sought-after feature. While standalone applications have provided this functionality, integrating this directly into browser-based applications has proven to be a challenge.
The need for noise and background voice cancellation extends beyond conventional communication platforms. In Telehealth, for instance, where accurate communication is vital for call-based diagnosis and consultation, background noise and voices can hinder effective communication. Another interesting example is insurance companies, taking calls from their customers from the place of an incident. Eliminating background noise ensures that critical information is accurately conveyed, leading to smoother claims processing and customer satisfaction. These, and many other use cases, often involve one-click web sessions for the calls.
The growing demand for quality communications in browser-based applications extends to both desktop and mobile devices. Up until recently, achieving compatibility with mobile devices, particularly with iOS Safari, posed significant difficulties. Limitations within Apple’s WebKit framework and the inherently CPU-intensive nature of JavaScript solutions hindered bringing the power of Krisp’s technologies to mobile browser applications.
The introduction of Single Instruction, Multiple Data (SIMD) support marked a significant opening for Krisp to deliver its market-leading technology into Safari specifically, and mobile browsers generally. SIMD enables parallel processing of data, significantly boosting performance and efficiency, particularly on mobile devices with limited computational resources.
By leveraging SIMD, the Krisp JS SDK has achieved low levels of CPU efficiency, making its market-leading noise cancellation available for users on mobile browser applications. This breakthrough not only enhances the user experience but also opens up new possibilities for web-based applications across various industries.
As Krisp’s technologies continue to evolve and extend into new territories, the ability to make AI Voice features available for all users across desktop and mobile browser-based applications is fundamental and allows users to have seamless access to the best voice processing technologies in the market.
Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.
The post Enhancing Browser App Experiences: Krisp JS SDK Pioneers In-browser AI Voice Processing for Desktop and Mobile appeared first on Krisp.
]]>The post Krisp Delivers AI-Powered Voice Clarity to Symphony’s Trader Voice Products appeared first on Krisp.
]]>BERKELEY, Calif., April 17, 2024 – Krisp, the world’s leading AI-powered voice productivity software, announced today a new integration with Symphony’s trader voice platform, Cloud9, to enhance voice audio clarity. The partnership delivers Krisp’s advanced AI Noise Cancellation, enabling Cloud9 users to experience clear audio in challenging environments, such as trading floors and busy offices.
Through the integration, Symphony will also be able to enhance its built-for-purpose financial markets voice analytics in partnership with Google Cloud. By creating a space of uninterrupted audio between counterparties both on and off of Cloud9, more efficient communication is allowed with fewer disagreements. Accurate transcription of audio recordings is enhanced with compliance review in mind.
“We are thrilled to partner with Symphony and integrate our Voice AI technology into their products,” said Robert Schoenfield, EVP of Licensing and Partnerships at Krisp. “Symphony’s customers operate in difficult noisy environments, dealing with high value transactions. Improving their communication is of real value.”
Symphony’s chief product officer, Michael Lynch, said: “Cloud9 SaaS approach to trader voice gives our users the flexibility to work from anywhere, whether it’s a bustling trading desk or their living room, and KRISP improves our already best-in-class audio quality to help our users make the best possible real-time decisions while also improving post-trade analytics and compliance.”
This collaboration not only brings improved communication quality but also aligns with Krisp’s and Symphony’s commitment to bringing industry-leading solutions to their customers.
About Krisp
Founded in 2017, Krisp pioneered the world’s first AI-powered Voice Productivity software. Krisp’s Voice AI technology enhances digital voice communication through audio cleansing, noise cancelation, accent localization, and call transcription and summarization. Offering full privacy, Krisp works on-device, across all audio hardware configurations and applications that support digital voice communication. Today, Krisp processes over 75 billion minutes of voice conversations every month, eliminating background noise, echoes, and voices in real-time, helping businesses harness the power of voice to unlock higher productivity and deliver better business outcomes.
Learn more about Krisp’s SDK for developers.
Press contact:
Shara Maurer
Head of Corporate Marketing
[email protected]
About Symphony
Symphony is the most secure and compliant markets’ infrastructure and technology platform, where solutions are built or integrated to standardize, automate and innovate financial services workflows. It is a vibrant community of over half a million financial professionals with a trusted directory and serves over 1000 institutions. Symphony is powering over 2,000 community built applications and bots. For more information, visit www.symphony.com.
Press contact:
Odette Maher
Head of Communications and Corporate Affairs
+44 (0) 7747 420807 / [email protected]
The post Krisp Delivers AI-Powered Voice Clarity to Symphony’s Trader Voice Products appeared first on Krisp.
]]>The post Twilio Partners with Krisp to Provide AI Noise Cancellation to All Twilio Voice Customers appeared first on Krisp.
]]>Trusted by more than 100 million users to process over 75 billion minutes of calls monthly, Krisp’s Voice AI SDKs are designed to identify human voice and cancel all background noise and voices to eliminate distractions on calls. Krisp SDKs are available for browsers (WASM JS), desktop apps (Win, Mac, Linux) and mobile apps (ioS, Android.) The Krisp Audio Plugin for Twilio Voice is a lightweight audio processor that can run inside your client application and create crystal clear audio.
The plugin needs to be loaded alongside the Twilio SDK and runs as part of the audio pipeline between the microphone and audio encoder in a preprocessing step. During this step, the AI-based noise cancellation algorithm removes unwanted sounds like barking dogs, construction noises, honking horns, coffee shop chatter and even other voices.
After the preprocessing step, the audio is encoded and delivered to the end user. Note that all of these steps happen on device, with near zero latency and without any media sent to a server.
Krisp’s AI Noise Cancellation requires you to host and serve the Krisp audio plugin for Twilio Voice as part of your web application. It also requires browser support of the WebAudio API (specifically Worklet.addModule). Krisp has a team ready to support your integration and optimization for great voice quality.
Learn more about Krisp here and apply for access to the world’s best Voice AI SDKs.
Visit the Krisp for Twilio Voice Developers page to request access to the Krisp SDK Portal. Once access is granted, download Krisp Audio JS SDK and place it in the assets of your project. Use the following code snippet to integrate the SDK with your project. Read the comments inside the code snippets for additional details.
Visit Krisp for Twilio Voice Developers and get started today.
The post Twilio Partners with Krisp to Provide AI Noise Cancellation to All Twilio Voice Customers appeared first on Krisp.
]]>The post On-Device STT Transcriptions: Accurate, Secure and Less Expensive appeared first on Krisp.
]]>
The on-device requirement has in many ways shaped the technical specifications of the technology and posed a series of challenges that the team has been able to tackle head-on, working through various iterations. The path to achieving high quality on-device STT continues, as the Krisp app has now transcribed over 15 million hours of calls and the company is now making this technology for its license partners via on-device SDKs. Let’s dive into the specific challenges Krisp worked through to bring this technology to market.
Without diving into the specifics of on-device STT technology and its architecture, one of the first and obvious constraints that the development had to be guided by was the computational resource. On-device STT systems operate within the confines of limited resources, including CPU, memory, and power. Unlike cloud-based solutions, which can leverage expansive server infrastructure, on-device systems must deliver comparable performance with significantly restricted resources. This constraint necessitates the optimization of algorithms, models, and processing pipelines to ensure efficient resource utilization without compromising accuracy and responsiveness. In many use cases, STT would need to run alongside the Noise Cancellation and other technologies, which further impacts the overall available bandwidth of resources.
The effectiveness of STT models hinges on their complexity and size, with larger models generally exhibiting superior accuracy and robustness. However, deploying large models on-device presents a formidable challenge, as it exacerbates memory and processing overheads. Balancing model complexity and size becomes paramount, requiring developers to employ techniques like model pruning, quantization, and compression to achieve optimal trade-offs between performance and resource utilization.
In order to achieve high quality transcripts and feature-rich speech-to-text systems, there is a need to build complex network architectures consisting of a number of AI models and algorithms. Such models include language models, punctuation and text normalization, speaker diarization and personalization (custom vocabulary) models, each presenting unique technical challenges and performance considerations.
The technology that Krisp employs both in its app and SDKs includes a combination of all of the above-mentioned technologies, as well as other adjacent algorithms to ensure readability and grammatical coherence of the final output.
The language model enhances transcription accuracy by predicting the likelihood of word sequences based on contextual and syntactic information. It helps in disambiguating words and improving the coherence of transcribed text. The Punctuation & Capitalization Model predicts the appropriate punctuation marks and capitalization based on speech patterns and semantic cues, enhancing the readability and comprehension of transcribed text. While the Inverse Text Normalization model standardizes and formats transcribed text to adhere to predefined conventions, such as converting numbers to textual representations or vice versa, expanding abbreviations, and correcting spelling errors. For cases where customers might have domain-specific terminology or proper names that are not widely recognized by the standard models, Krisp also provides Custom Vocabulary support.
Apart from the features ensuring text readability and accuracy, a major important technology included in Krisp’s on-device STT is Speaker Diarization. This model segments speech into distinct speaker segments, enabling the identification and differentiation of multiple speakers within a conversation or audio stream. It is crucial for speaker-dependent processing and improving transcription accuracy in multi-speaker scenarios.
Depending on a use case, on-device STT technology might have to deliver real or near real-time processing capabilities to enable seamless user interactions across diverse applications. Achieving low-latency speech recognition necessitates streamlining inference pipelines, minimizing computational overheads, and optimizing signal processing algorithms. Moreover, the heterogeneity of device architectures and hardware accelerators further complicates real-time performance optimization, requiring tailored solutions for different platforms and configurations. Krisp developers have achieved a delicate balance between latency, selecting optimal model combinations, ensuring processing synergy, and addressing the scalability and flexibility of the pipeline to accommodate various use-cases.
With a global and multi-domain user-base, there is an inherent variability of speech arising from diverse accents, vocabularies, environments, and speaking styles. Our on-device STT technology must exhibit robustness to such variability to ensure consistent performance across disparate contexts. This entails training models on diverse datasets, augmenting training data to encompass various scenarios, and implementing robust feature extraction techniques capable of capturing salient speech characteristics while mitigating noise and various device or network-dependent distortions.
In addition to addressing resource constraints and optimizing algorithms for on-device STT, Krisp prioritizes rigorous speech recognition testing to ensure its technology’s robustness across diverse accents, environments, and speaking styles.
Along with being on-device, the technologies underlying the Krisp app AI Meeting Assistant are also designed with embeddability in mind. Integrating on-device STT technology into communication applications and devices presents a range of additional challenges, all of which Krisp has tackled. Resources must be carefully allocated to ensure optimal performance without compromising existing customer infrastructure. Customization and configuration options are essential to meet the diverse needs of end-users while maintaining scalability and performance across large-scale deployments. Security and compliance considerations demand robust encryption and privacy measures to protect sensitive data. Seamless integration with existing infrastructure, including telephony systems and collaboration tools, requires interoperability standards, codec support and integration frameworks.
One prevailing requirement for communication services is for on-device STT technology to be functional on the web. This presents a new set of challenges in terms of further resource optimization, as well as compatibility across diverse web platforms, browsers, frameworks and devices.
While the integration of on-device STT technology into communication applications and devices presents challenges and requires meticulous resource utilization, customization, and seamless interoperability, Krisp has addressed these challenges and today delivers embedded STT solutions that enhance the functionality and value proposition for applications and their end-users.
Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.
The post On-Device STT Transcriptions: Accurate, Secure and Less Expensive appeared first on Krisp.
]]>The post Deep Dive: AI’s Role in Accent Localization for Call Centers appeared first on Krisp.
]]>
Accent refers to the distinctive way in which a group of people pronounce words, influenced by their region, country, or social background. In broad terms, English accents can be categorized into major groups such as British, American, Australian, South African, and Indian among others.
Accents can often be a barrier to communication, affecting the clarity and comprehension of speech. Differences in pronunciation, intonation, and rhythm can lead to misunderstandings.
While the importance of this topic goes beyond call centers, our primary focus is this industry.
The call center industry in the United States has experienced substantial growth, with a noticeable surge in the creation of new jobs from 2020-onward, both on-shore and globally.
In 2021, many US based call centers expanded their footprints thanks to the pandemic-fueled adoption of remote work, but growth slowed substantially in 2022. Inflated salaries and limited resources drove call centers to deepen their offshore operations, both in existing and new geographies.
There are several strong incentives for businesses to expand call centers operations to off-shore locations, including:
However, offshore operations come with a cost. One major challenge offshore call centers face is decreased language comprehension. Accents, varying fluency levels, cultural nuances and inherent biases lead to misunderstandings and frustration among customers.
According to Reuters, as many as 65% of customers have cited difficulties in understanding offshore agents due to language-related issues. Over a third of consumers say working with US-based agents is most important to them when contacting an organization.
While the world celebrates global and diverse workforces at large, research shows that misalignment of native language backgrounds between speakers leads to a lack of comprehension and inefficient communication.
Accent neutralization training is used as a solution to improve communication clarity in these environments. Call Centers invest in weeks-long accent neutralization training as part of agent onboarding and ongoing improvement. Depending on geography, duration, and training method, training costs can run $500-$1500 per agent during onboarding. The effectiveness of these training programs can be limited due to the inherent challenges in altering long-established accent habits. So, call centers may find it necessary to temporarily remove agents from their operational roles for further retraining, incurring additional costs in the process.
Call centers limit their site selection to regions and countries where accents of the available talent pool is considered to be more neutral to the customer’s native language, sacrificing locations that would be more cost-effective.
Recent advancements in Artificial Intelligence have introduced new accent localization technology. This technology leverages AI to translate source accents to targets accent in real-time, with the click of a button. While the technologies in production don’t support multiple accents in parallel, over time this will be solved as well.
Below is the evolution of Krisp’s AI Accent Localization technology over the past 2 years.
Version | Demo |
---|---|
v0.1 First model | |
v0.2 A bit more natural sound | |
v0.3 A bit more natural sound | |
v0.4 Improved voice | |
v0.5 Improved intonation transfer |
This innovation is revolutionary for call centers as it eliminates the need for difficult and expensive training and increases the talent pool worldwide, providing immediate scalability for offshore operations.
It’s also highly convenient for agents and reduces the cognitive load and stress they have today. This translates to decreased short-term disability claims and attrition rates, and overall improved agent experience.
There are various ways AI Accent Localization can be integrated into a call center’s tech stack.
It can be embedded into a call center’s existing CX software (e.g. CCaaS and UCaaS) or installed as a separate application on the agent’s machine (e.g. Krisp).
Currently, there are no CX solutions in market with accent localization capabilities, leaving the latter as the only possible path forward for call centers looking to leverage this technology today.
Applications like Krisp have accent localization built in their offerings.
These applications are on-device, meaning they sit locally on the agent’s machine. They support all CX software platforms out of the box since they are installed as a virtual microphone and speaker.
AI runs on an agent’s device so there is no additional load on the network.
The deployment and management can be done remotely, and at scale, from the admin dashboard.
At a fundamental level, speech can be divided into 4 parts: voice, text, prosody and accent.
Accents can be divided into 4 parts as well – phoneme, intonation, stress and rhythm.
In order to localize or translate an accent, three of these parts must be changed – phoneme pronunciation, intonation, and stress. Doing this in real-time is an extremely difficult technical problem.
While there are numerous technical challenges in building this technology, we will focus on eight majors.
Let’s discuss them individually.
Collecting accented speech data is a tough process. The data must be highly representative of different dialects spoken in the source language. Also, it should cover various voices, age groups, speaking rates, prosody, and emotion variations. For call centers, it is preferable to have natural conversational speech samples with rich vocabulary targeted for the use case.
There are two options: buy ready data or record and capture the data in-house. In practice, both can be done in parallel.
An ideal dataset would consist of thousands of hours of speech where source accent utterance is mapped to each target accent utterance and aligned with it accurately.
However, getting precise alignment is exceedingly challenging due to variations in the duration of phoneme pronunciations. Nonetheless, improved alignment accuracy contributes to superior results.
The speech synthesis part of the model, which is sometimes referred to as the vocoder algorithm in research, should produce a high-quality, natural-sounding speech waveform. It is expected to sound closer to the target accent, have high intelligibility, be low-latency, convey natural emotions and intonation, be robust against noise and background voices, and be compatible with various acoustic environments.
As studies by the International Telecommunication Union show (G.114 recommendation), speech transmission maintains acceptable quality during real-time communication if the one-way delay is less than approximately 300 ms. Therefore, the latency of the end-to-end accent localization system should be within that range to ensure it does not impact the quality of real-time conversation.
There are two ways to run this technology: locally or in the cloud. While both have theoretical advantages, in practice, more systems with similar characteristics (e.g. AI-powered noise cancellation, voice conversion, etc.) have been successfully deployed locally. This is mostly due to hard requirements around latency and scale.
To be able to run locally, the end-to-end neural network must be small and highly optimized, which requires significant engineering resources.
Having a sophisticated noise cancellation system is crucial for this Voice AI technology. Otherwise, the speech synthesizing model will generate unwanted artifacts.
Not only should it eliminate the input background noise but also the input background voices. Any sound that is not the speaker’s voice must be suppressed.
This is especially important in call center environments where multiple agents sit in close proximity to each other, serving multiple customers simultaneously over the phone.
Detecting and filtering out other human voices is a very difficult problem. As of this writing, to our knowledge, there is only one system doing it properly today – Krisp’s AI Noise Cancellation technology.
Acoustic conditions differ for call center agents. The sheer volume of combinations of device microphones and room setups (accountable for room echo) makes it very difficult to design a robust system against such input variations.
Not transferring the speaker’s intonation in the generated speech will result in a robotic speech that sounds worse than the original.
Krisp addressed this issue by developing an algorithm capturing input speaker’s intonation details in real-time and leveraging this information in the synthesized speech. Solving this challenging problem allowed us to increase the naturalness of the generated speech.
It is desirable to maintain the speaker’s vocal characteristics (e.g., formants, timbre) while generating output speech. This is a major challenge and one potential solution is designing the speech synthesis component so that it generates speech conditioned on the input speaker’s voice ‘fingerprint’ – a special vector encoding a unique acoustic representation of an individual’s voice.
Mispronounced words can be difficult to correct in real-time, as the general setup would require separate automatic speech recognition and language modeling blocks, which introduce significant algorithmic delays and fail to meet the low latency criterion.
One approach to accent localization involves applying Speech-to-Text (STT) to the input speech and subsequently utilizing Text-to-Speech (TTS) algorithms to synthesize the target speech.
This approach is relatively straightforward and involves common technologies like STT and TTS, making it conceptually simple to implement.
STT and TTS are well-established, with existing solutions and tools readily available.
Integration into the algorithm can leverage these technologies effectively. These represent the strengths of the method, yet it is not without its drawbacks. There are 3 of them:
First, let’s define what a phoneme is. A phoneme is the smallest unit of sound in a language that can distinguish words from each other. It is an abstract concept used in linguistics to understand how language sounds function to encode meaning. Different languages have different sets of phonemes; the number of phonemes in a language can vary widely, from as few as 11 to over 100. Phonemes themselves do not have inherent meaning but work within the system of a language to create meaningful distinctions between words. For example, the English phonemes /p/ and /b/ differentiate the words “pat” and “bat.”
The objective is to first map the source speech to a phonetic representation, then map the result to the target speech’s phonetic representation (content), and then synthesize the target speech from it.
This approach enables the achievement of comparatively smaller delays than Approach 1. However, it faces the challenge of generating natural-sounding speech output, and reliance solely on phoneme information is insufficient for accurately reconstructing the target speech. To address this issue, the model should also extract additional features such as speaking rate, emotions, loudness, and vocal characteristics. These features should then be integrated with the target speech content to synthesize the target speech based on these attributes.
Another approach is to create parallel data using deep learning or digital signal processing techniques. This entails generating a native target-accent sounding output for each accented speech input, maintaining consistent emotions, naturalness, and vocal characteristics, and achieving an ideal frame-by-frame alignment with the input data.
If high-quality parallel data are available, the accent localization model can be implemented as a single neural network algorithm trained to directly map input accented speech to target native speech.
The biggest challenge of this approach is obtaining high-quality parallel data.The quality of the final model directly depends on the quality of parallel data.
Another drawback is the lack of integrated explicit control over speech characteristics, such as intonation, voice, or loudness. Without this control, the model may fail to accurately learn these important aspects.
High-quality output of accent localization technology should:
To evaluate these quality features, we use the following objective metrics:
WER is a crucial metric used to assess STT systems’ accuracy. It quantifies the word level errors of predicted transcription compared to a reference transcription.
To compute WER we use a high-quality STT system on generated speech from test audios that come with predefined transcripts.
The evaluation process is the following:
The assumption in this methodology is that a model demonstrating better intelligibility will have a lower WER score.
The AL model may retain some aspects of the original accent in the converted speech, notably in the pronunciation of phonemes. Given that state-of-the-art STT systems are designed to be robust to various accents, they might still achieve low WER scores even when the speech exhibits accented characteristics.
To identify phonetic mistakes, we employ the Phoneme Error Rate (PER) as a more suitable metric than WER. PER is calculated in a manner similar to WER, focusing on phoneme errors in the transcription, rather than word-level errors.
For PER calculation, a high-quality phoneme recognition model is used, such as the one available at https://huggingface.co/facebook/wav2vec2-xlsr-53-espeak-cv-ft. The evaluation process is as follows:
This method addresses the phonetic precision of the AL model to a certain extent.
To assess the naturalness of generated speech, one common method involves conducting subjective listening tests. In these tests, listeners are asked to rate the speech samples on a 5-point scale, where 1 denotes very robotic speech and 5 denotes highly natural speech.
The average of these ratings, known as the Mean Opinion Score (MOS), serves as the naturalness score for the given sample.
In addition to subjective evaluations, obtaining an objective measure of speech naturalness is also feasible. It is a distinct research direction—predicting the naturalness of generated speech using AI. Models in this domain are developed using large datasets comprised of subjective listening assessments of the naturalness of generated speech (obtained from various speech-generating systems like text-to-speech, voice conversion, etc).
These models are designed to predict the MOS score for a speech sample based on its characteristics. Developing such models is a great challenge and remains an active area of research. Therefore, one should be careful when using these models to predict naturalness. Notable examples include the self-supervised learned MOS predictor and NISQA, which represent significant advances in this field.
In addition to objective metrics mentioned above, we conduct subjective listening tests and calculate objective scores using MOS predictors. We also manually examine the quality of these objective assessments. This approach enables a thorough analysis of the naturalness of our AL models, ensuring a well-rounded evaluation of their performance.
The following diagrams show how the training and inference are organized.
AI Training
AI Inference
In navigating the complexities of global call center operations, AI Accent Localization technology is a disruptive innovation, primed to bridge language barriers and elevate customer service while expanding talent pools, reducing costs, and revolutionizing CX.
The post Deep Dive: AI’s Role in Accent Localization for Call Centers appeared first on Krisp.
]]>The post The Power of On-Device Transcription in Call Centers appeared first on Krisp.
]]>With advancements in Speech-to-text AI and on-device AI, the call center industry is approaching a transformative change. We should start rethinking the traditional approach of cloud-based transcriptions, bringing the process directly onto the agents’ devices.
Let’s dive into why this makes sense for modern call centers and BPOs.
The post The Power of On-Device Transcription in Call Centers appeared first on Krisp.
]]>The post Speech Enhancement On The Inbound Stream: Challenges appeared first on Krisp.
]]>In today’s digital age, online communication is essential in our everyday lives. Many of us find ourselves engaged in many online meetings and telephone conversations throughout the day. Due to the nature of our work in today’s world, some find themselves having calls or attending meetings in less-than-ideal environments such as cafes, parks, or even cars. One persistent issue during our virtual interactions is the prevalence of background noise emitted from the other end of the call. This interference can lead to significant misunderstandings and disruptions, impairing comprehension of the speaker’s intended messages.
To effectively address the problem of unwanted noise in online communication, an ideal solution would involve applying noise cancellation on each user’s end, specifically on the signal captured by their microphone. This technology would effectively eliminate background sounds for each respective speaker, enhancing intelligibility and maximizing the overall online communication experience. Unfortunately, this is not the case, and not everyone has noise cancellation software, meaning they may not sound clear. This is where the inbound noise cancellation comes into play.
When delving into the topic of noise cancellation, it’s not uncommon to see terms such as speech enhancement, noise reduction, noise suppression, and speech separation used in a similar context.
In this article, we will go over the inbound speech enhancement and the main challenges applying it to online communication. But first, let’s dive in and understand the buzz and the difference between the inbound and outbound streams in the terminology of speech enhancement.
The terms inbound and outbound we have referred to in the context of speech enhancement are basically about at which point in the communication pipeline the enhancement takes place.
Speech Enhancement On Outbound Streams
In the case of speech enhancement on the outbound stream, the algorithms are applied at the sending end after capturing the sound from the microphone but before transmitting the signal. In one of our previously published articles on Speech Enhancement, we focused mainly on the outbound use case.
Speech Enhancement On Inbound Streams
In the case of speech enhancement on the inbound stream, the algorithms are applied at the receiving end – after receiving the signal from the network but before passing it to the actual loudspeaker/headphone. Unlike outbound speech enhancement, which preserves the speaker’s voice, inbound speech enhancement has to preserve the voices of multiple speakers, all while canceling background noise.
Together, these technologies revolutionize online communication, delivering an unparalleled experience for users around the world.
Online communication is a diverse landscape, encompassing a wide array of scenarios, devices, and network conditions. As we strive to enhance the quality of audio in both outbound and inbound communication, we encounter several challenges that transcend these two realms.
Wide Range of Devices Support
Online communication takes place across an extensive spectrum of devices. The microphones in those devices range from built-in laptop microphones to high-end headphones, webcams with integrated microphones, and external microphone setups. Each of these microphones may have varying levels of quality and sensitivity. Ensuring that speech enhancement algorithms can adapt to and optimize audio quality across this device spectrum is a significant challenge. Moreover, the user experience should remain consistent regardless of the hardware in use.
Wide Range of Bandwidths Support
One critical aspect of audio quality is the signal bandwidth. Different devices and audio setups may capture and transmit audio at varying signal bandwidths. Some may capture a broad range of frequencies, while others may have limited bandwidth. Speech enhancement algorithms must be capable of processing audio signals across this spectrum efficiently. This includes preserving the essential components of the audio signal while adapting to the limitations or capabilities of the particular bandwidth, ensuring that audio quality remains as high as possible.
Strict CPU Limitations
Online communication is accessible to a broad audience, including individuals with low-end PCs or devices with limited processing power. Balancing the demand for advanced speech enhancement with these strict CPU limitations is a delicate task. Engineers must create algorithms that are both computationally efficient and capable of running smoothly on a range of hardware configurations.
Large Variety of Noises Support
Background noise in online communication can vary from simple constant noise, like the hum of an air conditioner, to complex and rapidly changing noises. Speech enhancement algorithms must be robust enough to identify and suppress a wide variety of noise sources effectively. This includes distinguishing between speech and non-speech sounds, as well as addressing challenges posed by non-stationary noises, such as music or babble noise.
As we have discussed, Inbound speech enhancement may be a critical component of online communication, focusing on improving the quality of audio received by users. However, this task comes with its own set of intricate challenges that demand innovative solutions. Here, we delve into the unique challenges faced when enhancing incoming audio streams.
Multiple Speakers’ Overlapping Speech Support
One of the foremost challenges in inbound speech enhancement is dealing with multiple speakers whose voices overlap. In scenarios like group video calls or virtual meetings, participants may speak simultaneously, making it challenging to distinguish individual voices. Effective inbound speech enhancement algorithms must not only reduce background noise but also keep the overlapping speech untouched, ensuring that every participant’s voice is clear and discernible to the listener.
Diversity of Users’ Microphones
Online communication accommodates an extensive range of devices from different speakers. Users may connect via mobile phones, car audio systems, landline phones, or a multitude of other devices. Each of these devices can have distinct characteristics, microphone quality, and signal processing capabilities. Ensuring that inbound speech enhancement works seamlessly across this diverse array of devices is a complex challenge that requires robust adaptability and optimization.
Wide Variety of Audio Codecs Support
Audio codecs are used to compress and transmit audio data efficiently over the internet. However, different conferencing applications and devices may employ various audio codecs, each with its own compression techniques and quality levels. Inbound speech enhancement must be codec-agnostic, capable of processing audio streams regardless of the codec in use, to ensure consistently high audio quality for users.
Software Processing of Various Conferencing Applications Support
Online communication occurs through a multitude of conferencing applications, each with its unique software processing and audio transmission methods. Optimally, inbound speech enhancement should be engineered to seamlessly integrate with any of these diverse applications while maintaining an uncompromised level of audio quality. This requirement is independent of any audio processing technique utilized in the application. These processes can span a wide spectrum from automatic gain control to proprietary noise cancellation solutions, each potentially introducing different levels and types of audio degradation.
Internet Issues and Lost Packets Support
Internet connectivity is prone to disruptions, leading to variable network conditions and packet loss. Inbound speech enhancement faces the challenge of coping with these issues gracefully. The algorithm must be capable of maintaining the audio quality in case of lost audio packets. The ideal solution would even be able to mitigate the damage caused by poor networks by advanced Packet Loss Concealment algorithms.
As mentioned in the Speech Enhancement blog post, to create a high-quality speech enhancement model, one needs to apply deep learning methods. Moreover, in the case of inbound speech enhancement, we need to apply more diverse data augmentation reflecting real-life scenarios of inbound use cases to obtain high-quality training data for the deep learning model. Now, let’s delve into some data augmentations specific to inbound scenarios
Modeling of multiple speaker calls
In the quest for an effective solution to inbound speech enhancement, one crucial aspect is the modeling of multi-speaker and multi-noise scenarios. In an ideal approach, the system employs sophisticated audio augmentation techniques to generate audio mixes (see the previous article) that consist of multiple voice and noise sources. These sources are thoughtfully designed to simulate real-world scenarios, particularly those involving multiple speakers speaking concurrently, common in virtual meetings and webinars.
Through meticulous modeling, each audio source is imbued with distinct acoustic characteristics, capturing the essence of different environments and devices. These scenarios are carefully crafted to represent the challenges of online communication, where users encounter a dynamic soundscape. By training the system with these diverse multi-speaker and multi-noise mixes, it gains the capability to adeptly distinguish individual voices and suppress background noise.
Modeling of diverse acoustic conditions
Building inbound speech enhancement requires more than just understanding the mix of voices and noise; it also involves accurately representing the acoustic properties of different environments. In the suggested solution, this is achieved through Room Impulse Response (RIR) and Infinite Impulse Response (IIR) modeling.
RIR modeling involves applying filters to mimic the way sound reflects and propagates in a room environment. These filters capture the unique audio characteristics of different environments, from small meeting rooms to bustling cafes. Simultaneously, IIR filters are meticulously designed to match the specific characteristics of different microphones, replicating their distinct audio signatures. By applying these filters, the system ensures that audio enhancement remains realistic and adaptable across a wide range of settings, further elevating the inbound communication experience.
Versatility Through Codec Adaptability
An essential aspect of an ideal solution for inbound speech enhancement is versatility in codec support. In online communication, various conferencing applications employ a range of audio codecs for data compression and transmission. These codecs can vary significantly in terms of compression efficiency and audio quality, from enterprise-specific VOIP codecs like G.711 and G.729 to high-end solutions such as OPUS and SILK.
To offer an optimal experience, the solution should be codec-agnostic, capable of seamlessly processing audio streams regardless of the codec in use. To meet this goal, we need to pass raw audio signals to various audio codecs as the codec augmentation of training data.
Inbound speech enhancement refers to the software processing that occurs at the listener’s end in online communication. This task comes with several challenges, including handling multiple speakers in the same audio stream, adapting to diverse acoustic environments, and accommodating various devices and software preferences used by users. In this discussion, we explored a series of augmentations that can be integrated into a neural network training pipeline to address these challenges and offer an optimal solution.
Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.
The post Speech Enhancement On The Inbound Stream: Challenges appeared first on Krisp.
]]>The post Applying the Seven Testing Principles to AI Testing appeared first on Krisp.
]]>With the constant growth of artificial intelligence, numerous algorithms and products are being introduced into the market on a daily basis, each with its unique functionality in various fields. Like software systems, AI systems also have their distinct lifecycle, and quality testing is a critical component of it. However, in contrast to software systems, where quality testing is focused on verifying the desired behavior generated by the logic, the primary objective in AI system quality testing is to ensure that the logic is in line with the expected behavior. Hence, the testing object in AI systems is the logic itself rather than the desired behavior. Also, unlike traditional software systems, AI systems necessitate careful consideration of tradeoffs, such as the delicate balance between overall accuracy and CPU usage.
Despite their differences, the seven principles of software testing outlined by ISTQB (International Software Testing Qualifications Board) can be easily applied to AI systems. This article will examine each principle in-depth and explore its correlation with the testing of AI systems. We will base this exposition on Krisp’s Voice Processing algorithms, which incorporate a family of pioneering technologies aimed at enhancing voice communication. Each of Krisp’s voice-related algorithms, based on machine learning paradigms, is amenable to the general testing principles we discuss below. However, for the sake of illustration and as our reference point, we will use Krisp’s Noise Cancellation technology (referred to as NC). This technology is based on a cutting-edge algorithm that operates on the device in real-time and is renowned for its ability to enhance the audio and video call experience by effectively eliminating disruptive background noises. The algorithm’s efficacy can be influenced by diverse factors, including usage conditions, environments, device specifications, and encountered scenarios. Rigorous testing of all these parameters becomes indispensable, accomplished by employing a diverse range of test datasets and conducting both subjective and objective evaluations. Therefore, it is crucial to test all these parameters, while keeping in mind the seven testing principles both,
This principle underscores that despite the effort invested in finding and fixing errors in software, it cannot guarantee that the product will be entirely defect-free. Even after comprehensive testing, defects can still persist in the system. Hence, it is crucial to continually test and refine the system to minimize the risk of defects that may cause harm to users or the organization.
This is particularly significant in the context of AI systems, as AI algorithms are in general not expected to achieve 100% accuracy. For instance, testing may reveal that an underlying AI-based algorithm, such as the NC algorithm, has issues when canceling out white noises. Even after addressing the problem for observed scenarios and improving accuracy, there is no guarantee that the issue will never resurface in other environments and be detected by users. Testing only serves to reduce the likelihood of such occurrences.
The second testing principle highlights that achieving complete test coverage by testing every possible combination of inputs, preconditions, and environmental factors is not feasible. This challenge is even more profound in AI systems since the number of possible combinations that the algorithm needs to cover is often infinite.
For instance, in the case of the NC algorithm, testing every microphone device, in every condition, with every existing noise type would be an endless endeavor. Therefore, it is essential to prioritize testing based on risk analysis and business objectives. For example, testing should focus on ensuring that the algorithm does not cause voice cutouts with the most popular office microphone devices in typical office conditions.
By prioritizing testing activities based on identified risks and critical functionalities, testing efforts can be optimized to ensure that the most important defects are detected and addressed first. This approach helps to balance the need for thorough testing with the practical realities of finite time, resources, and costs associated with testing.
This principle emphasizes the importance of conducting testing activities as early as possible in the development lifecycle, including AI products. Early testing is a cost-effective approach to identifying and resolving defects in the initial phases, reducing expenses and time, whereas detecting errors at the last minute can be the most expensive to rectify.
Ideally, testing should start during the algorithm’s research phase, way before it becomes an actual product. This can help to identify possible undesirable behavior and prevent it from propagating to the later stages. This is especially crucial since the AI algorithm training process can take weeks if not months. Detecting errors in the late stages can lead to significant release delays, impacting the project’s timeline and budget.
Testing the algorithm’s initial experimental version can help to identify stability concerns and reveal any limitations that require further refinement. Furthermore, early testing helps to validate whether the selected solution and research direction are appropriate for addressing the problem at hand. Additionally, analyzing and detecting patterns in the algorithm’s behavior assists in gaining insights into the research trajectory and potential issues that may arise.
In the context of the NC project, early testing can help to evaluate whether the algorithm works steadily when subjected to different scenarios, such as retaining considerable noise after a prolonged period of silence. Finding and addressing such patterns at an early stage ensures smoother and more efficient resolution.
The principle of defect clustering suggests that software defects and errors are not distributed randomly but instead tend to cluster or aggregate in certain areas or components of the system. This means that if multiple errors are found in a specific area of the software, it is highly likely that there are more defects lurking in that same region. This principle also applies to AI systems, where the discovery of logically undesired behavior is a red flag that future tests will likely uncover similar issues.
For example, suppose a noise cancellation algorithm struggles to cancel TV fuzz noise when presented alongside speech. In that case, further tests may reveal the same problem when the algorithm is faced with background noise from an air conditioner. Although both noises may seem alike, careful observation of the frequency visual representation (Spectrogram) clearly highlights the noticeable distinction.
TV Fuzz Noise
AC Noise
Credits: https://www.youtube.com/
Identifying high-risk areas that could potentially damage user experience is another way that the principle of defect clustering can be beneficial. During the testing process, these areas can be given top priority. Using the same noise cancellation algorithm as an example, the highest-risk area is the user’s speech quality, and voice-cutting issues are considered the most critical defects. If the algorithm exhibits voice-cutting behavior in one scenario, it is possible that it may also cause voice degradation in other scenarios.
The pesticide paradox is a well-known phenomenon in pest control, where applying the same pesticide to pests can lead to immunity and resistance. The same principle applies to software and AI testing, where the “pesticide” is testing, and the “pests” are software bugs. Running the same set of tests repeatedly may lead to the diminishing effectiveness of those tests as both AI and software systems adapt and become immune to them.
To avoid the pesticide paradox in software and AI testing, it’s essential to use a range of testing techniques and tools and to update tests regularly. When existing test cases no longer reveal new defects, it’s time to implement new cases or incorporate new techniques and tools.
Similarly, AI systems must be constantly monitored for the pesticide paradox, known as “overfitting.” Although practically an AI system can never be entirely free of bugs, it’s essential to update test datasets and cases, and possibly introduce new metrics or tools, once the system can easily handle existing tests.
Take the example of a noise cancellation algorithm that initially worked almost flawlessly on audios with low-intensity noise. To enhance the algorithm’s quality and discover new bugs, new datasets were created, featuring higher-intensity noises that reached speech loudness.
By avoiding the pesticide paradox in both software and AI testing, we can protect our systems and ensure they remain effective in their respective fields.
Testing methods must be tailored to the specific context in which they will be implemented. This means that testing is always context-dependent, with each software or AI system requiring unique testing approaches, techniques, and tools.
In the realm of AI, the context includes a range of factors, such as the system’s purpose, the characteristics of the user base, whether the system operates in real-time or not, and the technologies utilized.
Context is of paramount importance when it comes to developing an effective testing strategy, techniques, and tools for the NC algorithms. Part of this context involves understanding the needs of the users, which can assist in selecting appropriate testing tools such as microphone devices. As a result, an understanding of context enables the reproduction and testing of users’ real-life scenarios.
By considering context, it becomes possible to create better-designed tests that accurately represent actual use cases, leading to more reliable and accurate results. Ultimately, this approach can ensure that testing is more closely aligned with real-world situations, increasing the overall effectiveness and utility of the NC algorithm.
The notion that the absence of errors can be achieved through complete test coverage is a fallacy. In some cases, there may be unrealistic expectations that every possible test should be run and every potential case should be checked. However, the principles of testing remind us that achieving complete test coverage is practically impossible. Moreover, it is incorrect to assume that complete test coverage guarantees the success of the system. For example, the system may function flawlessly but still be difficult to use and fail to meet the needs and requirements of users from the product perspective. This also applies to AI systems as well.
Consider, for instance, the NC algorithm, which may have excellent quality, cancel out noise and maintain speech clarity, making it virtually free of defects. However, it being too complex and large, may render it useless as a product.
To sum up, AI systems have unique lifecycles and require a different approach to testing from traditional software systems. Nevertheless, the seven principles of software testing outlined by the ISTQB can be applied to AI systems. Applying these principles can help identify critical defects, prioritize testing efforts, and optimize testing activities to ensure that the most important defects are detected and addressed first. All Krisp’s algorithms have demonstrated that adhering to these rules leads to improved accuracy, higher quality, and increased reliability and safety of the AI system in various fields.
Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.
The post Applying the Seven Testing Principles to AI Testing appeared first on Krisp.
]]>The post Can You Hear a Room? appeared first on Krisp.
]]>Sound propagation has distinctive features associated with the environment where it happens. Human ears can often clearly distinguish whether a given sound recording was produced in a small room, large room, or outdoors. One can even get a sense of a direction or a distance from the sound source by listening to a recording. These characteristics are defined by the objects around the listener or a recording microphone such as the size and material of walls in a room, furniture, people, etc. Every object has its own sound reflection, absorption, and diffraction properties, and all of them together define the way a sound propagates, reflects, attenuates, and reaches the listener.
In acoustic signal processing, one often needs a way to model the sound field in a room with certain characteristics, in order to reproduce a sound in that specific setting, so to speak. Of course, one could simply go to that room, reproduce the required sound and record it with a microphone. However, in many cases, this is inconvenient or even infeasible.
For example, suppose we want to build a Deep Neural Net (DNN)-based voice assistant in a device with a microphone that receives pre-defined voice commands and performs actions accordingly. We need to make our DNN model robust to various room conditions. To this end, we could arrange many rooms with various conditions, reproduce/record our commands in those rooms, and feed the obtained data to our model. Now, if we decide to add a new command, we would have to do all this work once again. Other examples are Virtual Reality (VR) applications or architectural planning of buildings where we need to model the acoustic environment in places that simply do not exist in reality.
In the case of our voice assistant, it would be beneficial to be able to encode and digitally record the acoustic properties of a room in some way so that we could take any sound recording and “embed” it in the room by using the room “encoding”. This would free us from physically accessing the room every time we need it. In the case of VR or architectural planning applications, the goal then would be to digitally generate a room’s encoding only based on its desired physical dimensions and the materials and objects contained in it.
Thus, we are looking for a way to capture the acoustic properties of a room in a digital record, so that we can reproduce any given audio recording as if it was played in that room. This would be a digital acoustic model of the room representing its geometry, materials, and other things that make us “hear a room” in a certain sense.
Room impulse response (RIR for short) is something that does capture room acoustics, to a large extent. A room with a given sound source and a receiver can be thought of as a black-box system. Upon receiving on its input a sound signal emitted by the source, the system transforms it and outputs whatever is received at the receiver. The transformation corresponds to the reflections, scattering, diffraction, attenuation and other effects that the signal undergoes before reaching the receiver. Impulse response describes such systems under the assumption of time-invariance and linearity. In the case of RIR, time-invariance means that the room is in a steady state, i.e, the acoustic conditions do not change over time. For example, a room with people moving around, or a room where the outside noise can be heard, is not time invariant since the acoustic conditions change with time. Linearity means that if the input signal is a scaled superposition of two other signals, x and y, then the output signal is a similarly scaled superposition of the output signals corresponding to x and y, individually. Linearity holds with sufficient fidelity in most practical acoustic environments (while time-invariance can be achieved in a controlled environment).
Let us take a digital approximation of a sound signal. It is a sequence of discrete samples, as shown in Fig. 1.
Fig. 1 The waveform of a sound signal.
Each sample is a positive or negative number that corresponds to the degree of instantaneous excitation of the sound source, e.g., a loudspeaker membrane, as measured at discrete time steps. It can be viewed as an extremely short sound, or an impulse. The signal can thus be approximately viewed as a sequence of scaled impulses. Now, given time-invariance and linearity of the system, some mathematics shows that the effect of a room-source-receiver system on an audio signal can be completely described by its effect on a single impulse, which is usually referred to as an impulse response. More concretely, impulse response is a function h(t) of time t > 0 (response to a unit impulse at time t = 0) such that for an input sound signal x(t), the system’s output is given by the convolution between the input and the impulse response. This is a mathematical operation that, informally speaking, produces a weighted sum of the delayed versions of the input signal where weights are defined by the impulse response. This reflects the intuitive fact that the received signal at time t is a combination of delayed and attenuated values of the original signal up to time t, corresponding to reflections from walls and other objects, as well as scattering, attenuation and other acoustic effects.
For example, in the recordings below, one can see the RIR recorded by a clapping sound (see below), an anechoic recording of singing, and their convolution.
RIR
Singing anechoic
Singing with RIR
It is often useful to consider sound signals in the frequency domain, as opposed to the time domain. It is known from Fourier analysis that every well-behaved periodic function can be expressed as a sum (infinite, in general) of scaled sinusoids. The sequence of the (complex) coefficients of sinusoids within the sum, the Fourier coefficients, provides another, yet equivalent representation of the function. In other words, a sound signal can be viewed as a superposition of sinusoidal sound waves or tones of different frequencies, and the Fourier coefficients show the contribution of each frequency in the signal. For finite sequences such as digital audio, that are of practical interest, such decompositions into periodic waves can be efficiently computed via the Fast Fourier Transform (FFT).
For non-stationary signals such as speech and music, it is more instructive to do analysis using the short-time Fourier transform (STFT). Here, we split the signal into short equal-length segments and compute the Fourier transform for each segment. This shows how the frequency content of the signal evolves with time (see Fig. 2). That is, while the signal waveform and Fourier transform give us only time and only frequency information about the signal (although one being recoverable from another), the STFT provides something in between.
Fig. 2 Spectrogram of a speech signal.
A visual representation of an STFT, such as the one in Fig. 2, is called a spectrogram. The horizontal and vertical axes show time and frequency, respectively, while the color intensity represents the magnitude of the corresponding Fourier coefficient on a logarithmic scale (the brighter the color, the larger is the magnitude of the frequency at the given time).
In theory, the impulse response of a system can be measured by feeding it a unit impulse and recording whatever comes at the output with a microphone. Still, in practice, we cannot produce an instantaneous and powerful audio signal. Instead, one could record RIR approximately by using short impulsive sounds. One could use a clapping sound, a starter gun, a balloon popping sound, or the sound of an electric spark discharge.
Fig. 3 The spectrogram and waveform of a RIR produced by a clapping sound.
The results of such measurements (see, for example, Fig. 3) may be not sufficiently accurate for a particular application, due to the error introduced by the structure of the input signal. An ideal impulse, in some mathematical sense, has a flat spectrum, that is, it contains all frequencies with equal magnitude. The impulsive sounds above usually significantly deviate from this property. Measurements with such signals may also be poorly reproducible. Alternatively, a digitally created impulsive sound with desired characteristics could be played with a loudspeaker, but the power of the signal would still be limited by speaker characteristics. Among other limitations of measurements with impulsive sounds are: particular sensitivity to external noise (from outside the room), sensitivity to nonlinear effects of the recording microphone or emitting speaker, and the directionality of the sound source.
Fortunately, there are more robust methods of measuring room impulse response. The main idea behind these techniques is to play a transformed impulsive sound with a speaker, record the output, and apply an inverse transform to recover impulse response. The rationale is, since we cannot play an impulse as it is with sufficient power, we “spread” its power across time, so to speak, while maintaining the flat spectrum property over a useful range of frequencies. An example of such a “stretched impulse” is shown in Fig. 4.
Fig. 4 A “stretched” impulse.
Other variants of such signals are Maximum Length Sequences and Exponential Sine Sweep. An advantage of measurement with such non-localized and reproducible test signals is that ambient noise and microphone nonlinearities can be effectively averaged out. There are also some technicalities that need to be dealt with. For example, the need for synchronization of emitting and recording ends, ensuring that the test signal covers the whole length of impulse response, and the need for deconvolution, that is applying an inverse transform for recovering the impulse response.
The waveform on Fig. 5 shows another measured RIR. The initial spike at 0-3 ms corresponds to the direct sound that has arrived to the microphone along a direct path. The smaller spikes following it and starting from about 3-5 ms from the first spike clearly show several early specular reflections. After about 80 ms there are no distinctive specular reflections left, and what we see is the late reverberation or the reverberant tail of the RIR.
Fig. 5 A room impulse response. Time is shown in seconds.
While the spectrogram of RIR seems not very insightful apart from the remarks so far, there is some information one can extract from it. It shows, in particular, how the intensity of different frequencies decreases with time due to losses. For example, it is known that intensity loss due to air absorption (attenuation) is stronger for higher frequencies. At low frequencies, the spectrogram may exhibit distinct persistent frequency bands, room modes, that correspond to standing waves in the room. This effect can be seen below a certain frequency threshold depending on the room geometry, the Schroeder frequency, which for most rooms is 20 – 250 Hz. Those modes are visible due to the lower density of resonant frequencies of the room near the bottom of the spectrum, with wavelength comparable to the room dimensions. At higher frequencies, modes overlap more and more and are not distinctly visible.
RIR can also be used to estimate certain parameters associated with a room, the most well-known of them being the reverberation time or RT60. When an active sound source in a room is abruptly stopped, it will take longer or shorter time for the sound intensity to drop to a certain level, depending on the room’s geometry, materials, and other factors. In the case of RT60, the question is, how long it takes for the sound energy density to decrease by 60 decibels (dB), that is, to the millionth of its initial value. As noted by Schroeder (see the references), the average signal energy at time t used for computing reverberation time is proportional to the tail energy of the RIR, that is the total energy after time t. Thus, we can compute RT60 by plotting the tail energy level of the RIR on a dB scale (with respect to the total energy). For example, the plot corresponding to the RIR above is shown in Fig. 6:
Fig. 6 The RIR tail energy level curve.
In theory, the RIR tail energy decay should be exponential, that is, linear on a dB scale, but, as can be seen here, it drops irregularly starting at -25 dB. This is due to RIR measurement limitations. In such cases, one restricts the attention to the linear part, normally between the values -5 dB and -25 dB, and obtains RT60 by fitting a line to the measurements of RIR in logarithmic scale, by linear regression, for example.
As mentioned in the introduction, one often needs to compute a RIR for a room with given dimensions and material specifications without physically building the room. One way of achieving this would be by actually building a scaled model of the room. Then we could measure the RIR by using test signals with accordingly scaled frequencies, and rescale the recorded RIR frequencies. A more flexible and cheaper way is through computer simulations, by building a digital model of the room and modeling sound propagation. Sound propagation in a room (or other media) is described with differential equations called wave equations. However, the exact solution of these equations is out of reach in most practical settings, and one has to resort to approximate methods for simulations.
While there are many approaches for modeling sound propagation, most common simulation algorithms are based on either geometrical simplification of sound propagation or element-based methods. Element-based methods, such as the Finite Element method, rely on numerical solution of wave equations over a discretized space. For this purpose, the room space is approximated with a discrete grid or a mesh of small volume elements. Accordingly, functions describing the sound field (such as the sound pressure or density) are defined down to the level of a single volume element. The advantage of these methods is that they are more faithful to the wave equations and hence more accurate. However the computational complexity of element-based methods grows rapidly with frequency, as higher frequencies require higher resolution of a mesh (smaller volume element size). For this reason, for wideband applications like speech, element-based methods are often used to model sound propagation only for low frequencies, say, up to 1 kHz.
Geometric methods, on the other hand, work in the time domain. They model sound propagation in terms of sound rays or particles with intensity decreasing with the squared path length from the source. As such, wave-specific interference between rays is abstracted away. Thus rays effectively become sound energy carriers, with the sound energy at a point being computed by the sum of the energies of rays passing through that point. Geometric methods give plausible results for not-too-low frequencies, e.g., above the Schroeder frequency. Below that, wave effects are more prominent (recall the remarks on room modes above), and geometric methods may be inaccurate.
The room geometry is usually modeled with polygons. Walls and other surfaces are assigned absorption coefficients that describe the fraction of incident sound energy that is reflected back into the room by the surface (the rest is “lost” from the simulation perspective). One may also need to model air absorption and sound scattering by rough materials with not too small features as compared to the sound wavelengths.
Two well-known geometric methods are stochastic Ray Tracing and Image Source methods. In Ray Tracing, a sound source emits a (large) number of sound rays in random directions, also taking into account directivity of the source. Each ray has some starting energy. It travels with the speed of sound and reflects from the walls while losing energy with each reflection, according to the absorption coefficients of walls, as well as due to air absorption and other losses.
Fig. 7 Ray Tracing (only wall absorption shown).
The reflections are either specular (incident and reflected angles are equal) or scattering happens, the latter usually being modeled by a random reflection direction. The receiver registers the remaining energy, time and angle of arrival of each ray that hits its surface. Time is tracked in discrete intervals. Thus, one gets an energy histogram corresponding to the RIR with a bucket for each time interval. In order to synthesize the temporal structure of the RIR, a random Poisson-distributed sequence of signed unit impulses can be generated, which is then scaled according to the energy histogram obtained from simulation to give a RIR. For psychoacoustic reasons, one may want to treat different frequency bands separately. In this case, the procedure of scaling the random impulse sequence is done for band-passed versions of the sequence, then their sum is taken as the final RIR.
The Image Source method models only specular reflections (no scattering). In this case, a reflected ray from a source towards a receiver can be replaced with rays coming from “mirror images” of the source with respect to the reflecting wall, as shown in Fig. 8.
Fig. 8 The Image Source method.
This way, instead of keeping track of reflections, we construct images of the source relative to each wall and consider straight rays from all sources (including the original one) to the receiver. These first order images cover single reflections. For rays that reach the receiver after two reflections, we construct the images of the first order images, call them second order images, and so on, recursively. For each reflection, we can also incorporate material absorption losses, as well as air absorption. The final RIR is constructed by considering each ray as an impulse that undergoes scaling due to absorption and distance-based energy losses, as well as a distance-based phase shift (delay) for each frequency component. Before that, we need to filter out invalid image sources for which the image-receiver path does not intersect the image reflection wall or is blocked by other walls.
While the Image Source method captures specular reflections, it does not model scattering that is an important aspect of the late reverberant part of a RIR. It does not model wave-based effects either. More generally, each existing method has its advantages and shortcomings. Fortunately, shortcomings of different approaches are often complementary, so it makes sense to use hybrid models that combine several of the methods described above. For modeling late reverberations, stochastic methods like Ray Tracing are more suitable, while they may be too imprecise for modeling the early specular reflections in a RIR. One could further rely on element-based methods like the Finite Element method for modeling RIR below the Schroeder frequency where wave-based effects are more prominent.
Room impulse response (RIR) plays a key role in modeling acoustic environments. Thus, when developing voice-related algorithms, be it for voice enhancement, automatic speech recognition, or something else, here at Krisp we need to keep in mind that these algorithms must be robust to changes in acoustics settings. This is usually achieved by incorporating the acoustic properties of various room environments, as was briefly discussed here, into the design of the algorithms. This provides our users with a seamless experience, largely independent of the room from which Krisp is being used: they don’t hear the room.
Krisp licenses its SDKs to embed directly into applications and devices. Learn more about Krisp’s SDKs and begin your evaluation today.
The post Can You Hear a Room? appeared first on Krisp.
]]>