July 2, 2019
Written by: Nitsan Goldstein
Communication is key to our daily lives. Most people use spoken language to communicate their feelings, needs, and desires with others. This is why when an injury or illness impairs a person’s ability to speak, it can be particularly devastating. It is often the case in conditions such as Amyotrophic Lateral Sclerosis (ALS) that people know exactly what they want to say, but have lost control of the muscles necessary to produce speech. Current options for these patients are limited and involve spelling-based approaches, which are much slower than spoken language. Therefore, neuroscientists and engineers are working to create a computer that will decode neural signals and produce speech, allowing patients to bypass the dysfunctional muscles and communicate freely again.
How does this technology work?
In a recent publication in the journal Nature1, a group at the University of California, San Francisco described a new way to convert brain waves to speech that greatly improved upon older designs. They adapted a powerful technique: brain-computer interface (BCI; for more about BCI see a full NeuroKnow article about it here). In short, BCI typically uses brain activity to control a robot that can be used, for example, as a prosthetic arm in patients who have had an amputation. The patients imagine moving their arm or move a cursor on a screen while activity in their motor cortex—the area of the brain responsible for moving your muscles—is monitored. The computer then learns what patterns of neural activity are associated with the patient attempting to make certain movements. The robotic arm is then programmed to respond accordingly. After intensive training, some patients are able to move the prosthetic arm in much the same way as a normal arm, just by using their brain! The group at UCSF reasoned, if BCI can be used to recreate arm movements, why not recreate the movements in the mouth and throat that produce speech?
This idea was the basis for the recent advancement in speech BCI technology. The group recorded neural activity from electroencephalography (EEG) sensors on the surface of the brains of five participants while they spoke several sentences. Then, using mathematical models that are built to learn complex relationships much like a brain, taught a computer to associate the neural activity with the spoken sentences. Importantly, they also generated data that described the precise movements of mouth and throat muscles that are necessary to produce the spoken sentences based on studies using special sensors on these muscles to detect movements. This key piece of information helped the model learn the associations much more efficiently, and is what distinguishes this study from previous attempts. So, with the neural activity recordings, muscle movement data, and the actual spoken sound, the model learned which neural activity patterns led to which muscle movements and spoken words. Once the model learned these relationships, it generated spoken language based solely on neural activity (Figure 1). The group compared the features of these sentences with the original spoken sentences and found them to be very similar. Incredibly, the sentences even included emphasis on certain words and diction similar to human speech, as this information is also encoded in tiny muscle movements.
A few questions remained, however. First, patients using this technology in the future may not be able to speak to train the model as the patients in this study did. Therefore, the model must be able to generate speech without audible language from the subject to train it. To test this ability, the researchers had subjects silently mime sentences instead of speak them aloud. They found that, while the model did not perform as well as when trained on spoken language, it still generated speech that was very similar to a trial in which the subject did audibly speak the sentence. Second, the scientists needed to make sure the model could decode sentences that it had never heard before. To test this, the group trained the model twice: once with all the sentences spoken by a subject and once with one sentence removed. They then compared the speech the two models produced when fed the neural activity from the sentence that was removed in one of the models. They found, amazingly, that there was no difference between the two generated sentences. This means that the second model was able to reproduce speech from neural activity that was not used to train it.
Importantly, the group needed to test how close these reproduced sentences really were to actual human speech. To answer this question, they presented the computer-generated sentences to hundreds of experimental subjects and asked them to identify the words spoken. The words and sentences were understood about 70% of the time- a significant improvement over previous models (to listen for yourself, see this video posted by the research group).
So, what are we waiting for? How far are we from actually using this technology to help people who could benefit from it?
How far is this technology from widespread use?
There are a few important points to consider before making this technology widely available. First, while 70% of computer-spoken words were understood, that rate is far lower than human speech. In an interview2, one of the authors was optimistic that current tools should allow scientists to refine the model and increase that number even further. For example, they noted that the computer was better at decoding sounds like ‘sh’ and ‘z’ than sounds like ‘b’ and ‘p’. Second, this technology relies on EEG sensors that are placed directly on the brain, not on the scalp like more “traditional” EEG sensors. This means that each patient would have to undergo brain surgery to use the device. Obviously, this risk must be considered and weighed against the benefits of restored speech. However, the recent work discussed here is certainly an exciting step in making this technology readily available. The hope is that these techniques will be refined and improved upon to offer a safer and highly efficient way to give millions of people their voices back.
- Anumanchipalli, G.K., Chartier, J., Chang, E.F. Speech synthesis from neural decoding of spoken sentences. Nature 568, 493-498 (2019).
- Taylor, C. California scientists have found a way to translate thoughts into computer-generated speech. CNBC (2019). https://www.cnbc.com/2019/04/25/california-scientists-found-a-way-to-translate-thoughts-into-speech.html
Cover image and Figure 1 created using BioRender.