Machine learning improves Arabic speech transcription capabilities

Thanks to advancements in speech and natural language processing, there is hope that one day you may be able to ask your virtual assistant what the best salad ingredients are. Currently, it is possible to ask your home gadget to play music, or open on voice command, which is a feature already found in some many devices.

If you speak Moroccan, Algerian, Egyptian, Sudanese, or any of the other dialects of the Arabic language, which are immensely varied from region to region, where some of them are mutually unintelligible, it is a different story. If your native tongue is Arabic, Finnish, Mongolian, Navajo, or any other language with high level of morphological complexity, you may feel left out.

These complex constructs intrigued Ahmed Ali to find a solution. He is a principal engineer at the Arabic Language Technologies group at the Qatar Computing Research Institute (QCRI)—a part of Qatar Foundation’s Hamad Bin Khalifa University and founder of ArabicSpeech, a “community that exists for the benefit of Arabic speech science and speech technologies.”

Ali became captivated by the idea of talking to cars, appliances, and gadgets many years ago while at IBM. “Can we build a machine capable of understanding different dialects—an Egyptian pediatrician to automate a prescription, a Syrian teacher to help children getting the core parts from their lesson, or a Moroccan chef describing the best couscous recipe?” he states. However, the algorithms that power those machines cannot sift through the approximately 30 varieties of Arabic, let alone make sense of them. Today, most speech recognition tools function only in English and a handful of other languages.

The coronavirus pandemic has further fueled an already intensifying reliance on voice technologies, where the way natural language processing technologies have helped people comply with stay-at-home guidelines and physical distancing measures. However, while we have been using voice commands to aid in e-commerce purchases and manage our households, the future holds yet more applications.

Millions of people worldwide use massive open online courses (MOOC) for its open access and unlimited participation. Speech recognition is one of the main features in MOOC, where students can search within specific areas in the spoken contents of the courses and enable translations via subtitles. Speech technology enables digitizing lectures to display spoken words as text in university classrooms.

According to a recent article in Speech Technology magazine, the voice and speech recognition market is forecast to reach $26.8 billion by 2025, as millions of consumers and companies around the globe come to rely on voice bots not only to interact with their appliances or cars but also to improve customer service, drive health-care innovations, and improve accessibility and inclusivity for those with hearing, speech, or motor impediments.

In a 2019 survey, Capgemini forecast that by 2022, more than two out of three consumers would opt for voice assistants rather than visits to stores or bank branches; a share that could justifiably spike, given the home-based, physically distanced life and commerce that the epidemic has forced upon the world for more than a year and a half.

Nonetheless, these devices fail to deliver to vast swaths of the globe. For those 30 types of Arabic and millions of people, that is a substantially missed opportunity.

Arabic for machines

English- or French-speaking voice bots are far from perfect. Yet, teaching machines to understand Arabic is particularly tricky for several reasons. These are three commonly recognised challenges:

Lack of diacritics. Arabic dialects are vernacular, as in primarily spoken. Most of the available text is nondiacritized, meaning it lacks accents such as the such as the acute (´) or grave (`) that indicate the sound values of letters. Therefore, it is difficult to determine where the vowels go.
Lack of resources. There is a dearth of labeled data for the different Arabic dialects. Collectively, they lack standardized orthographic rules that dictate how to write a language, including norms or spelling, hyphenation, word breaks, and emphasis. These resources are crucial to train computer models, and the fact that there are too few of them has hobbled the development of Arabic speech recognition.
Morphological complexity. Arabic speakers engage in a lot of code switching. For example, in areas colonized by the French—North Africa, Morocco, Algeria, and Tunisia—the dialects include many borrowed French words. Consequently, there is a high number of what are called out-of-vocabulary words, which speech recognition technologies cannot fathom because these words are not Arabic.

“But the field is moving at lightning speed,” Ali says. It is a collaborative effort between many researchers to make it move even faster. Ali’s Arabic Language Technology lab is leading the ArabicSpeech project to bring together Arabic translations with the dialects that are native to each region. For example, Arabic dialects can be divided into four regional dialects: North African, Egyptian, Gulf, and Levantine. However, given that dialects do not comply with boundaries, this can go as fine-grained as one dialect per city; for example, an Egyptian native speaker can differentiate between one’s Alexandrian dialect from their fellow citizen from Aswan (a 1,000 kilometer distance on the map).

Building a tech-savvy future for all

At this point, machines are about as accurate as human transcribers, thanks in great part to advances in deep neural networks, a subfield of machine learning in artificial intelligence that relies on algorithms inspired by how the human brain works, biologically and functionally. However, until recently, speech recognition has been a bit hacked together. The technology has a history of relying on different modules for acoustic modeling, building pronunciation lexicons, and language modeling; all modules that need to be trained separately. More recently, researchers have been training models that convert acoustic features directly to text transcriptions, potentially optimizing all parts for the end task.

Even with these advancements, Ali still cannot give a voice command to most devices in his native Arabic. “It’s 2021, and I still cannot speak to many machines in my dialect,” he comments. “I mean, now I have a device that can understand my English, but machine recognition of multi-dialect Arabic speech hasn’t happened yet.”

Making this happen is the focus of Ali’s work, which has culminated in the first transformer for Arabic speech recognition and its dialects; one that has achieved hitherto unmatched performance. Dubbed QCRI Advanced Transcription System, the technology is currently being used by the broadcasters Al-Jazeera, DW, and BBC to transcribe online content.

There are a few reasons Ali and his team have been successful at building these speech engines right now. Primarily, he says, “There is a need to have resources across all of the dialects. We need to build up the resources to then be able to train the model.” Advances in computer processing means that computationally intensive machine learning now happens on a graphics processing unit, which can rapidly process and display complex graphics. As Ali says, “We have a great architecture, good modules, and we have data that represents reality.”

Researchers from QCRI and Kanari AI recently built models that can achieve human parity in Arabic broadcast news. The system demonstrates the impact of subtitling Aljazeera daily reports. While English human error rate (HER) is about 5.6%, the research revealed that Arabic HER is significantly higher and can reach 10% owing to morphological complexity in the language and the lack of standard orthographic rules in dialectal Arabic. Thanks to the recent advances in deep learning and end-to-end architecture, the Arabic speech recognition engine manages to outperform native speakers in broadcast news.

While Modern Standard Arabic speech recognition seems to work well, researchers from QCRI and Kanari AI are engrossed in testing the boundaries of dialectal processing and achieving great results. Since no one speaks Modern Standard Arabic at home, attention to dialect is what we need to enable our voice assistants to understand us.

This content was written by Qatar Computing Research Institute, Hamad Bin Khalifa University, a member of Qatar Foundation. It was not written by MIT Technology Review’s editorial staff.