Research

Aavaaz Expert-Led Validation:

Rigorous Human-in-the-Loop Validation to Ensure Clinical-Grade Accuracy & Precision

February 10, 2025

Introduction

Ensuring robust validation of accuracy, safety, and contextual appropriateness for the use of multilingual speech solutions in healthcare is an essential and required metric to building clinical-grade technology – a benchmark that Aavaaz takes incredibly seriously with the founder and CEO being a clinician herself. Rather than simply using human feedback during development or early validation then transition to AI-only (as is the case with most translation technology), Aavaaz intentionally chooses to continually leverage this strong community across iterative phases of core model development, new feature testing and future innovation planning. It is this that sets Aavaaz apart as a leader in conversational and domain-specific multilingual voice technology.

To challenge the poor reputation of other machine translation (MT) solutions which still struggle with domain-specific terminology and real-world conversational complexity for medical settings, Aavaaz not only developed robust technology with superior performance, but also implemented a rigorous human-in-the-loop (HITL) validation process to address hesitancies around the safe and effective adoption in clinical settings. This validation framework has integrated an expert network of linguists and multilingual professionals in cyclical human panel reviews, simulated clinical testing, and real-world clinical use to ensure robust, high-fidelity, and safe translations in diverse medical scenarios.

Expert Insight Panel

To ensure a strong and trusted foundation for technical development, Aavaaz intentionally chose to collaborate with a range of expert voices from the very beginning so that deep and detailed insights could be uncovered to guide development. This expert panel included linguists, certified medical interpreters, multilingual healthcare professionals and native-speaking community members across the globe – they remained key voices throughout the phases listed below which are then further elaborated.

  1. Pre-Development Insight Collection
  2. Human-in-the-Loop Iterative Development
  3. Phase 1 Validation & Simulated Testing
  4. Phase 2 Validation & Clinical Testing
  5. Longitudinal Validation

Pre-Development Insight Collection

Prior to development, it was key to first understand prominent error patterns of currently available machine translation softwares in order to form an evidence-based approach to improved development. The expert panel was formed and first began to provide core insights on key error categories in currently available machine translation systems (mainly Google & GPT were assessed during this initial phase). The categories included medical or colloquial mistranslation, cultural misrepresentation, emotional unawareness and contextual inaccuracy, among other core deficiencies due to the lack of clinical and conversational appropriateness considered when developing these outdated technologies. The error patterns were then analyzed by leveraging a broad range of perspectives – ranging from core linguistic experts and advanced medically trained professionals to generalized voices of potential patients represented by native-speaking community members [Aavaaz Allies].

The following are examples of error categories that were collected and analyzed.

Medical Appropriateness & Accuracy:

  1. Inaccurate translations of both common & complex medical terminology across languages and clinical specialties, limiting use in healthcare settings
  2. Loss of contextual meaning in colloquial and culturally specific expressions, compromising the ability to accurately understand medical terminology and phrases
  3. Lack of natural sounding translations, as of a true native speaker
  4. Errors in grammatical structure and sentence formulation that impact impacting clarity
  5. Ambiguity in medical content expressed, which could compromise patient safety
General Conversational Accuracy:
  1. Inability to understand translations due to lack of conversational tone
  2. Complete mistranslation causing confusion and misunderstanding

These expert assessments guided the initial phases of analyzing, insight gathering and strategic planning which then allowed focused and evidence-led technical development.

Human-in-the-Loop Iterative Development

A human-in-the-loop feedback cycle was a central and non-negotiable component of the technical development process. Medical professionals, linguists, and bilingual experts reviewed the system's output in real-time, focusing on translation errors, contextual misinterpretations, and tone inconsistencies. Annotated and verbal feedback was used to iteratively refine the system, ensuring that it met the nuanced requirements of clinical communication. This iteration cycle included the following components in varying orders as they pertained to the development status of the language in question at the time.

  1. Blinded review of Aavaaz translation model outputs versus industry leading MT systems
    1. Reviewers were given blinded samples of each model’s translation output in a multiple-choice fashion to ensure unbiased results – they were asked to choose the best output based on the specific criteria being assessed at the time (medical accuracy, grammar, conversational-nature, tone etc.) as well as provide insight on their choice.
  2. Annotated feedback via Aavaaz testing interface
    1. Reviewers were given a set of sentences across medical scenarios and clinical specialties which they were asked to review as correct or incorrect in assessing both speech recognition and translation. If incorrect, reviewers tagged the specific incorrect words and provided corrected outputs for Aavaaz AI team to analyze.
  3. Live, Verbal Feedback with AI & Engineering Team
    1. The expert reviewers who participated in a live review provided further in-depth insights in conversation with the AI & engineering teams. This included uncovering core linguistic origins of errors such as wrong syllables, words, grammatical contents etc. The live session also allowed testing and validation in real-time to understand areas of further optimization.

Phase 1 Validation & Simulated Testing

After completing multiple iterations of model training, human feedback, iteration and retraining which demonstrated strong consistency of acceptable outputs, the final model metrics were conducted using BLEU and METEOR scores – two industry standard metrics for assessing machine translation accuracy. If the scores did not reach an acceptable level, the training process continued. If the scores did reach an acceptable level, phase 1 validation and testing began with simulated scenarios.

To capture a broad range of real-world situations and ensure generalizability of results, Phase 1 testing was analyzed on a set of 6 common clinical conversation types, across 10 medical specialties as listed below:

Conversation Types:

  1. Initial Patient Visit
  2. Treatment Discussion
  3. Medical Team Rounds
  4. Medication Reconciliation
  5. Patient Education
  6. Discharge Instructions

Medical Specialties:
  1. Emergency Medicine
  2. Cardiology
  3. Oncology
  4. General Inpatient Medicine
  5. Ambulatory Care
  6. Allergy & Immunology
  7. Ophthalmology
  8. Endocrinology
  9. Physical Medicine
  10. Surgery

Expert panelists used Aavaaz technology across the 6 conversation types and 10 medical specialties by reading out sample scripts and bi-directional conversations that very closely mimicked real-world discussion between patients and care teams. The feedback was then analyzed in the same fashion as 2. Human-in-the-Loop Iterative Development.

Phase 2 Validation & Clinical Testing

Once Phase 1 testing was completed and if the robust clinical scenarios tested were passed without any errors that would cause patient harm, phase 2 validation began. However, it is to be noted that this phase of clinical testing only began after consistent non-clinical and simulated testing was achieved without high-risk errors such as complete mistranslation or omission.
An overview of this clinical validation phase is as follows:

  • Use by native-speaking bilingual medical professionals in “low risk” patient interactions
  • Further validation by native-speaking users across medical settings
  • Extension of use to non-native, non-bilingual medical professional
  • Gradual adoption in “moderate” and “high” risk patient interactions once validation received from native and non-native medical users

During Phase 2, clinical testing began with “low risk” patient interactions. While it is difficult to term any interaction as “low risk” given there is an inherent risk of miscommunication any time there is a language barrier with a patient, this phase was intentionally and carefully carried out by choosing settings in which very basic medical terminology was used (such as patient intake, appointment reminders, basic lab visits etc.) and was first done in a situations where the medical professional themselves was a native speaker of the language in question. For example, in one of the first urgent care and clinic settings tested, the nurses and care team members were native Spanish and Mandarin speakers – here, the native-speaking team used Aavaaz while in conversation with patients, where accuracy could be assessed and validated in real-time. Once gaining depth and confidence across “low risk” use with native speaking medical professionals, the testing was then spread across to non-native users and to “moderate risk” interactions such as general in-patient or clinic visits, then to “high risk” such as emergency rooms and cardiology ICU.

Longitudinal Validation

While robust accuracy and validation were achieved allowing clinical roll-out, it is critical to continue longitudinal validation that will ensure the highest degree of precision. Aavaaz continues to engage expert reviewers to ensure strict validation measures throughout. Rather than simply using human feedback during development and early validation, as is with most AI-only companies, Aavaaz intentionally chooses to continually leverage this strong community across iterative phases of core model development, new feature testing and future innovation planning. This approach sets Aavaaz apart as a leader in conversational and domain-specific multilingual voice technology.

Conclusion

The Aavaaz validation process ensures clinical-grade accuracy, contextual understanding, and real-world applicability of multilingual speech-to-speech technology across a variety of clinical settings. This rigorous methodology underscores the critical role of human oversight in technology-driven language ambient voice solutions for healthcare, especially where language and translation are involved. It also highlights the robustness and seriousness that Aavaaz places on iterative human expertise, simulated testing, and phased clinical validation at each and every phase of development or validation to ensure safe, precise and trust technology.

This structured framework is the foundation of Aavaaz technology and stands to deliver clinical reliability and trust as the passion-driven mission continues in providing confidence, trust and seamless experiences for healthcare providers, medical interpreters, and vulnerable, non-English patient populations.