Machine learning improves Arabic speech transcription capabilities


Thanks to advances in speech and natural language processing, there is hope that one day you will be able to ask your virtual assistant what the best ingredient for power is. Currently, it’s possible to ask your home device to play music, or open it with a voice command, a feature already in some devices.

If you speak Moroccan, Algerian, Egyptian, Sudanese, or any of the other dialects of Arabic, which vary greatly from region to region, with some being mutually incomprehensible, that’s a different story. If your native language is Arabic, Finnish, Mongolian, Navajo, or any other language with a high level of morphological complexity, you may feel left out.

These complex combinations piqued Ahmed Ali’s curiosity for a solution. He is a Principal Engineer in the Arabic Language Technologies Group at Qatar Computing Research Institute (QCRI) – part of Qatar Foundation’s Hamad Bin Khalifa University and founder of ArabicSpeech, “a community that exists for the benefit of speech sciences and Arabic speech technologies.”

Qatar Foundation headquarters

Ali became fascinated with the idea of ​​talking to cars, appliances, and gadgets several years ago while at IBM. “Can we build a machine capable of understanding different dialects – an Egyptian pediatrician to automate a prescription, a Syrian teacher to help children with the essential parts of their lesson, or a Moroccan chef prescribing the best couscous recipe?” declares. However, the algorithms that power these machines can’t sift through the nearly 30 types of Arabic, let alone understand them. Today, most speech recognition tools only work in English and a handful of other languages.

The coronavirus pandemic has increased an already heavy reliance on voice technologies, as the way natural language processing technologies are helping people comply with stay-at-home guidelines and physical distancing measures. However, while we have been using voice commands to help with e-commerce purchases and manage our homes, the future holds more applications.

Millions of people around the world use Open Online Courses (MOOC) for open access and unlimited participation. Speech recognition is a key feature of the MOOC, where students can search specific areas in the spoken contents of courses and enable translations via subtitles. Speech technology enables digitization of lectures to display spoken words as text in university classrooms.

Ahmed Ali, Hamad Bin Kahlifa University

According to a recent article in Speech Technology, the voice and speech recognition market is expected to reach $26.8 billion by 2025, as millions of consumers and businesses around the world rely on voice robots not only to interact with their devices or vehicles but also to improve customer service, Drive healthcare innovations, and improve accessibility and inclusion for those with hearing, speech or movement disabilities.

In a 2019 survey, Capgemini predicted that by 2022, more than two out of three consumers will choose voice assistants instead of visiting stores or bank branches; A share that could justifiably rise, given the physically distant home life and trade that the pandemic has imposed on the world for more than a year and a half.

However, these devices have failed to reach large swathes of the world. For those 30 types of Arabic and millions of people, this is a largely missed opportunity.

Arabic for machines

English or French speaking voice robots are far from perfect. However, teaching machines to understand Arabic is particularly difficult for several reasons. These are three generally known challenges:

  1. No diacritics. Arabic dialects are colloquial dialects, as in primarily spoken language. Most of the available text is unscripted, which means that it lacks accents such as acute (´) or algebra (`) that indicate the phonetic values ​​of letters. Therefore, it is difficult to determine where the vowels go.
  2. Lack of resources. There is a dearth of disaggregated data for the different Arabic dialects. Collectively, they lack the uniform spelling rules that dictate how a language is written, including grammar or spelling, hyphenation, word breaks, and emphasis. These resources are essential for training computer models, and the fact that there are few of them has hampered the development of Arabic speech recognition.
  3. morphological complexity. Arabic speakers engage in a lot of code-switching. For example, in areas colonized by the French – North Africa, Morocco, Algeria and Tunisia – the dialects include many French loanwords. Thus, there are a large number of so-called words outside the vocabulary, which speech recognition technologies cannot understand because these words are not Arabic.

“But the field is moving at lightning speed,” Ali says. It’s a collaborative effort between several researchers to get it moving faster. The Ali Lab for Arabic Language Technology is leading the Arabic Discourse Project to bring together Arabic translations and the indigenous dialects of each region. For example, Arabic dialects can be divided into four regional dialects: North African, Egyptian, Gulf, and Levantine. However, since dialects do not correspond to borders, this can be as accurate as one dialect for each city; For example, a native Egyptian speaker can differentiate the Alexandrian dialect from his compatriot from Aswan (1000 km distance on the map).

Building a tech-savvy future for everyone

At this point, the machines are just as accurate as human copiers, thanks in large part to advances in deep neural networks, a subfield of machine learning in artificial intelligence that relies on algorithms inspired by how the human brain works, both biologically and functionally. However, until recently, speech recognition was compromised quite a bit all together. The technology has a history of relying on various modules for phonemic modeling, building speech dictionaries, and language modeling. All units need to be trained separately. More recently, researchers have trained models that convert audio features directly into transcription, potentially optimizing all parts for the final task.

Even with these developments, Ali is still unable to give voice commands to most devices in his native Arabic. “It’s 2021,” he comments, “and I still can’t speak to many machines in my language.” “I mean, now I have a device that can understand my English, but automatic recognition of multi-dialect Arabic speech hasn’t happened yet.”

Achieving this is the focus of Ali’s work, which culminated in the first converter to learn about Arabic speech and its dialects; Which has achieved an unparalleled performance so far. This technology, called the QCRI Advanced Transcription System, is currently used by Al Jazeera, DW and BBC broadcasters to transcribe content online.

There are several reasons why Ali and his team have succeeded in building these speech engines today. Primarily, he says, “there is a need to get the resources across all dialects. We need to build the resources so that we can then train the model.” Advances in computer processing mean that computationally intensive machine learning is now taking place in the GPU, which can quickly process and render complex graphics. As Ali says, “We have great architecture, good modules, and we have data that represents reality.”

Researchers from QCRI and Kanari AI have recently built models that can achieve human parity in broadcasting Arabic news. The system shows the impact of translating Al Jazeera’s daily reports on the screen. While the human error rate in English (HER) is about 5.6%, the research showed that Arabic HER is significantly higher and can reach 10% due to the morphological complexity of the language and the lack of standard orthography in the Arabic dialect. Thanks to recent advances in deep learning and end-to-end engineering, the Arabic speech recognition engine has been able to outperform native speakers in broadcast news.

While speech recognition in Modern Standard Arabic appears to be working fine, researchers from QCRI and Kanari AI are busy testing the limits of dialect processing and achieving impressive results. Since no one speaks Modern Standard Arabic in the house, paying attention to the dialect is what we need to enable our voice assistants to understand us.

This content was written by Qatar Computing Research Institute, Hamad Bin Khalifa University, a member of Qatar Foundation. It was not written by the editorial team at the MIT Technology Review.



Source link

Share:

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings