VOC is made up of white speakers from Sacramento and Humboldt County, California. The researchers believe the acoustic side of ASRs from all five companies were simply not trained on enough audio data from AAVE speakers to recognize what they were saying. Black speakers are more likely to use African American Vernacular English (AAVE), a style of English that has its own grammatical rules and vocabulary. Speech-to-text systems are broken into two parts: a language model trained on text and an acoustic model trained on sounds. “We do not use the firms' consumer-facing voice assistants because they do not provide a straightforward way to obtain batch transcriptions of about 40 hours of audio files, and could require playing audio out loud – adding an additional source of error – as opposed to directly obtaining audio from a. "We use the APIs provided by each of the service providers which are paid services generally for commercial use,” Allison Koenecke, first author of the paper and a PhD student at Stanford University’s Institute for Computational and Mathematical Engineering, told The Register. The red dashed line is the average WER for black speakers and the blue dashed line is the average WER for white speakers. The word error rate (WER) for ASR systems developed by Apple, Amazon, Google, IBM, and Microsoft.
0 Comments
Leave a Reply. |