A group of collaborators at TTIC and the University of Chicago—led by Professor Karen Livescu (TTIC), Professor Greg Shakhnarovich (TTIC), and Professor Diane Brentari (University of Chicago, Department of Linguistics)—are working on automatic translation of American Sign Language video to written English. This research area is a combination of natural language processing, computer vision, linguistics, and machine learning problems. The seeds of this project started in 2010.
“At that time, we were focusing on collecting data ourselves here at TTIC, so we used studio data that was very controlled,” Professor Livescu said. “The current incarnation of the project really got going a few years ago when we started collecting larger-scale, real-world data online.”
The research area of sign language translation is still in the beginning stages, according to Professor Livescu, but it is important to make today’s technologies more accessible to the Deaf community. Today’s AI technologies can generate captions for many spoken languages from audio, offer intelligent virtual assistant services such as Alexa or Siri, and can provide written and spoken language translation services.
However, these technologies are generally not accessible to those who rely on communicating through sign language. According to Professor Shakhnarovich, this project has the potential for societal impact because improving sign language processing would improve the accessibility of information technology to Deaf people. According to the World Federation of the Deaf, there are about 70 million deaf people worldwide and over 200 different sign languages.
There is also a lot of visual media produced by Deaf individuals in sign languages, such as news, video blogs, and discussions, that are not generally accessible to the population that doesn’t know sign language. Important discussions and discourse that go on in Deaf communities don’t get as much attention outside of the community.
“The same way you can Google search for news articles, documents, and media for written and spoken languages, we will one day be able to search for and access media that is in sign language or produced by Deaf community members and have it automatically translated,” Professor Livescu said.
Several TTIC students have been involved with this project. Taehwan Kim, a TTIC student who graduated in 2016, worked on studio data, and Bowen Shi (TTIC) pioneered the use of real-world (naturally occurring) video data on a larger scale. To learn more about Shi’s work on this project, read his highlight article here. PhD candidate Marcelo Sandoval-Casteñeda (TTIC) is also currently involved in this project.
“We’ve also had many students from the University of Chicago’s linguistics department involved in various ways, such as informing us about properties of sign language we should be paying attention to or annotating, as well as helping with data collection,” Professor Livescu said. “For example, Jonathan Keane (University of Chicago), is a recent PhD graduate, who did research on the linguistics of handshape, which informed our work. Aurora Martinez del Rio (University of Chicago) was instrumental in annotating and scaling up our data collection. Graduate student Saulé Tuganbaeva (University of Chicago) is also currently involved in analyzing our natural ASL data from a linguistic perspective.”
The biggest progress the group has made has been collecting large-scale, real-world (naturally occurring) data sets, first for fingerspelling data and then for general American Sign Language (ASL). Fingerspelling is a component of sign language that involves spelling out words letter by letter and makes up roughly 12-35% of ASL. It is often used to communicate names, brands, organizations, or complex terms. Most recently, the collaboration has produced a data set of real-world ASL videos with English translations. All of the data sets have been released publicly, and are known as Chicago Fingerspelling in the Wild (ChicagoFSWild) and OpenASL.
There are various obstacles that make sign language translation challenging.
Sign languages are an example of “low-resource languages,” which are languages for which there is much less data available. If you were to use Google Translate or speech recognition systems, they work really well for the most popular languages in the world, but there are thousands of low-resource languages in the world where technology doesn’t serve them as well.
“With written languages, you can record every use of every word, but with [sign languages] there isn’t nearly as much data compared to written languages,” Professor Shakhnarovich said. “In fact, other low-resource languages besides sign languages have similar obstacles, so we hope that this project can impact research on other low-resource languages as well, by harnessing smaller amounts of data to build better-performing models.”
Sign languages also raise interesting challenges that spoken and written languages don’t. They are visual languages that use three-dimensional space to encode meaning in ways that we don’t have access to within spoken and written languages, which means that we need to use not only speech and language processing technologies, but also computer vision in order to understand them.
“You may think that computer vision researchers who have been working on tracking human poses in videos should be able to track sign language poses,” Professor Livescu said. “Unfortunately, computer vision models for pose tracking don’t work well for the motions that are involved in sign language, which tend to be very fast and involve co-articulation, and which are constantly moving as opposed to holding static poses. Every gesture you make will affect how the next gesture works, and you need to take into account that this is a language.”
“Our hope is that we are able to develop technology that works for a large spectrum of sign languages in the world,” Professor Shakhnarovich said.
Collaboration across discipline boundaries has been very important for this project.
“When you’re working in this area, you’re working between language, computer vision, linguistics, and machine learning, so it’s important to be open to perspectives from all of these disciplines,” Professor Livescu said.