The Rosetta Blog > Most Recent Posts

Loading Comment Data...

Posted 3 years, 3 months ago by

Rosetta Project interview on ABC Canberra


Loading Comment Data...

Posted 3 years, 3 months ago by Laura Welcher

Social media promote the use of endangered languages

While globalization is usually considered a primary factor in language endangerment, global economies also provide access to inexpensive communication technologies like the internet and mobile devices - and these technologies are increasingly enlisted as tools to increase the use of endangered languages, as reported recently in the BBC News.

Many endangered language speech communities are gravitating towards Twitter, as well as social media services like Facebook, to promote language use and language learning. For children especially, the ability to use their heritage language with these ubiquitous social media sites provides an essential "coolness" factor, giving their languages relevance and an important new domain of use in the modern world.

Those who use smaller languages on public sites like blogs, or Twitter, are creating an additional resource that they are probably unaware of: the language that they craft and post helps build a text corpus for their language that can pave the way for better tools to enhance that language's use online.

Dr. Kevin Scannell, a computer scientist, mathematician and endangered language speaker, has created a multilingual web crawler called An Crúbadán (which literally means "crawler" in Irish). The crawler identifies and computes the probability of 3-character sequences, which provide a unique "fingerprint" for any given written language. Here is an image showing the catch he netted with a recent trawl: over 1,000 different languages being used online (click on the image below for more information):


According to Scannell, the identified ever-growing corpora provide a means "to develop basic resources that help people use their language online: keyboard input methods, spell checkers, online dictionaries, and so on." In his other research, he has developed crawlers that explicitly capture endangered language Tweets (Indigenous Tweets) as well as blogs (Indigenous Blogs) which he says "aim to strengthen languages through social media."

Loading Comment Data...

Posted 3 years, 4 months ago by Alex Mensing

Language: Speed vs Density

The September 02011 issue of the journal Language included an article entitled “A Cross-Language Perspective on Speech Information Rate,” by a team of linguists working with the University of Lyon and the French National Center for Scientific Research. Like many linguistic studies, this one investigates the parameters of human language and seeks to identify commonalities that hold true across languages. Given, however, that universal grammatical rules have proven more difficult to define than linguists might have hoped, this study was designed to test the universality of a different factor - time.

The authors hypothesize that “a trade-off is operating between a syllable-based average information density and the rate of transmission of syllables in human communication.” Basically, a language that is spoken quickly - in terms of syllables per second - uses more syllables than one that is spoken slowly in order to say the same thing.

To test their hypothesis they analyzed audio recordings of native speakers of seven different languages reading brief texts written in various styles. There were twenty texts, each composed in English and translated into the other languages. The authors compared the number of syllables that each language used in a given text, as well as the amount of time taken by speakers of different languages to actually say the entire texts. And indeed, they found that for the most part the languages whose texts used more syllables were spoken faster, and vice versa, resulting in equivalent rates of information output. Two complementary strategies for encoding and transmitting ideas.

"One has to consider the [...] loose hypothesis that [the information rate of the language] varies within a range of values that guarantee efficient communication, fast enough to convey useful information and slow enough to limit the communication cost (in its articulatory, perceptual, and cognitive dimensions)."

The premise here is that each translation of each of the texts communicates all of the information communicated in each other translation, adding and subtracting nothing, simply encoding the information according to the rules of each language. It would be interesting to discuss this premise with respect to the notion of linguistic relativity, which argues that your native language actually influences the way you perceive reality. Or in terms of issues such as evidentials or honorifics, which can require that certain information - and therefore more syllables - be included in a statement which in another language might be superfluous. Further research might also be able to analyze across more language families. The seven languages used in this study were English, German, French, Spanish, Italian, Mandarin Chinese and Japanese.

The authors also noted that the syllables themselves in quickly spoken languages are on average less complex, in that they are composed of fewer sounds (i.e. ‘law’ vs. ‘claw’ - both one syllable, but with different numbers of phonemes). As an initial investigation into the speed with which people communicate through speech, this is a fascinating study.

It seems that humans may be naturally and universally self-regulating when it comes to communicating through speech. There is a balance that cannot be disturbed: fast syllables are not allowed to carry too much meaning, and syllables with lots of information must be spoken slowly.


Loading Comment Data...

Posted 3 years, 4 months ago by Laura Welcher

PanLex joins Rosetta at Long Now

PanLex, the newest project under the umbrella of The Long Now Foundation, has an ambitious plan: to create a database of all the words of all of the world's languages. The plan is not merely to collect and store them, but to link them together so that any word in any language can be translated into a word with the same sense in any other language. Think of it as a multilingual translating dictionary on steroids.

You may wonder how this is different from some of the other popular translation tools out there. The more ambitious tools, such as Babelfish and Google Translate, try to translate sentences, while the more modest tools, such as Global Glossary, Glosbe, and Logos, limit their scope to individual words. PanLex belongs to the second, words-only, group, but is far more inclusive. While Google Translate covers 64 languages and Logos almost 200 languages, PanLex is edging close to 7,000 languages. With the knowledge stored in PanLex, translations can be produced extending beyond those found in any dictionary.

Here’s an example to give the basic idea of how it works. Say you want to translate the Swahili word ‘nyumba’ (house) into Kyrgyz (a Central Asian language with about 3 million speakers). You’re unlikely to find a Swahili–Kyrgyz dictionary; if you look up ‘nyumba’ in PanLex you’ll find that even among its half a billion direct (attested) translations there isn’t any from this Swahili word into Kyrgyz. So you ask PanLex for indirect translations. PanLex reveals translations of ‘nyumba’ that, in turn, have four different Kyrgyz translations. Three of these (‘башкы уяча’, ‘үй барак’, and ‘байт’) each have only one or two links to ‘nyumba’. But a fourth Kyrgyz word, ‘үй’, is linked to ‘nyumba’ by 45 different intermediary translations. You look them over and conclude that ‘үй’ is the most credible answer.

How confident can you be of your inferred translation—that Swahili ‘nyumba’ can be translated into Kyrgyz ‘үй’? After all, anyone who has played the game of “translation telephone” (where you start with Language A, translate into Language B, go from there to Language C and then translate back to Language A) will know this kind of circular translation can result in hilarious mismatches. But PanLex is designed to overcome “semantic drift” by allowing multiple intermediary languages. Paths from ‘nyumba’ to ‘үй’, for example, run through diverse languages from Azerbaijani to Vietnamese. Based on such multiple translation paths, translation engines can provide ranked “best fit” translations. As the database grows, especially in its coverage of “long tail” languages, possible translation paths will multiply, boosting reliability.

Translation of Swahili 'nyumba' into Kyrgyz

There are a couple of demonstrations that you can try with a browser. This will give you a sense of the magnitude of the data and the potential power of the database as a tool. One of these is TeraDict. If you enter a common English word like ‘house’ or ‘love’ you are likely to get translations into hundreds, or even thousands, of languages, and in some cases many translations per language. French, for example, has 25 translations for ‘house’ and 55 translations of ‘love’, including ‘zéro’ (hint: Think tennis!). Two similar interfaces allow you to explore the database in either Esperanto—InterVorto—or Turkish—TümSöz.

The second web tool, PanLem, is considerably more complicated and is used mostly by PanLex developers to enlarge and evaluate the database. But it’s publicly accessible. There is a step-by-step "cheat sheet" to help you climb the learning curve.

PanLex is an ongoing research project, with most of its growth yet to come, but the database already documents 17 million expressions and 500 million direct translations, from which billions of additional translations can be inferred.

PanLex is being built using data from about 3,600 bilingual and multilingual dictionaries, most in electronic form. The process of ingesting data into the database involves substantial curation and standardization by PanLex editors to ensure data quality. The next stage of collection will likely involve dictionaries that exist only in print form. It is hard to say how many are out there, but we expect it is on the order of tens of thousands. It is likely that most of these have not been scanned or digitized. Once they are, there will be a significant effort to improve the optical character recognition (OCR) for these materials—an effort which is likely to be highly informative to the development of OCR technology, since it will involve the human identification of many forms of many different scripts for languages around the world.

PanLex is working closely with the Rosetta Project. PanLex is a wonderful realization of the Rosetta Project’s original goal in building a massive, and massively parallel, lexical collection for all of the world’s languages.


Loading Comment Data...

Posted 3 years, 8 months ago by Austin Brown

Dr. Laura Welcher at Berkeley Language Center, November 9th

The Berkeley Language Center will be hosting a talk by Long Now’s Dr. Laura Welcher on November 9th. The talk is open to the public and starts at 3:00pm in Dwinelle Hall B-4.

The Rosetta Project at The Long Now Foundation is working to build an open public digital collection of all human language as well as an analog backup that can last for thousands of years–The Rosetta Disk. In the “long now,” the goal is long-term storage and access to information–on the scale that both supports and transcends individual human societies and civilizations. In the “here and now,” the project serves to support and amplify the importance of the world’s nearly 7,000 human languages, the vast majority of which are endangered and, if current trends continue, likely to go extinct in the next 100 years. I’ll present our current work on the Rosetta Project Collection and Disk as well as some new initiatives including the “Language Commons” where we are working to help build the multilingual Web.
There will be a reception afterwards; come say Hello.

Loading Comment Data...

Posted 3 years, 8 months ago by Austin Brown

Dr. Laura Welcher speaking at the Internationalization and Unicode Conference

With thousands of languages and writing systems used all over the world, making computers and the web widely accessible has taken a herculean effort, with much yet to be done.

One of the main tools used in the expansion of the web’s global reach is Unicode - a database of over 193,000 characters from 93 different writing systems and the standards for using and representing them.

Unicode is maintained by The Unicode Consortium, which sponsors a conference each year to share knowledge and discuss the future of Unicode.

This year the Internationalization and Unicode Conference will be held October 17th - 19th in Santa Clara, CA.

Long Now’s Dr. Laura Welcher will be delivering a keynote presentation on Tuesday October 18th of her work on The Rosetta Project, a publicly accessible digital library of human languages, and The Language Commons:

The Rosetta Project shares the Unicode vision of a world where people can use communication technology on their own terms - in their own language.

According to World Internet Statistics, over 80% of all web communication is in about ten languages, with over half in either English or Chinese. The remaining 20% represent "everyone else" including about 400 languages with speaker populations above 1 million, which collectively comprise about 95% of everyone on earth.

Because of essential technologies like Unicode, we are poised to see this breadth of human languages flourish online and on mobile devices, providing for these languages a critical new domain of language use in the modern world. I will present several efforts underway at The Rosetta Project including the "Language Commons" that rely on Unicode as an essential technology in building the multilingual Web.

Loading Comment Data...

Posted 3 years, 10 months ago by Laura Welcher

A Record-a-thon for Human Language

On July 30, 02011 The Rosetta Project partnered with to hold the first human language Record-a-thon at the Internet Archive. This is an event we developed to test the idea that with a few basic guidelines, anyone can use common video devices to help document human language.


The idea is that by creating a 5-10 minute unedited video, and providing basic information about it - essentially just saying what language you think it is in - and then uploading it to the Rosetta Project collection in the Internet Archive, you are helping build a corpus of valuable data for that language. You don't need to be a specialist, and by archiving it you create a resource that others can build on, for many different useful purposes - from language learning and teaching, to linguistic analysis, to building the tools that enable a language to be used with modern technology.

This introductory talk by Dr. Laura Welcher, made the morning of the event, describes the ideas behind the creation of the Record-a-thon:

In the course of a single day, both in-person and remote partipants combined created about 85 videos in 34 different languages. There were speakers of all ages, native and non-native, some quite fluent while others were learners practicing their skills. All the videos they created are interesting to watch and are available here in the Rosetta Project video collection. They recorded conversations, told stories, histories, and jokes, recited poems, and sang lullabies. Here is a sampling (click on the images to see the videos):

  • Chihota speaking his native Shona. Shona is a language of Zimbabwe with about 11 million speakers. Chihota took home one of the Record-a-thon prizes, having made recordings of himself speaking Shona, Swahili, Sheng (an emergent Swahili-English mixed language) and Chilapalapa (a pidgin that emerged in the mines of South Africa). He also speaks fluent English and Russian. Chihota was unsure that we would consider all of these languages but we assured him we were interested in them all:


  • Arturo Avila speaking his native Mixteco Bajo from Oaxaca, Mexico. Mixtec languages comprise a cluster of about 50 related languages in Mexico, having anywhere from a few hundred to a few thousand speakers each. Mr. Avila was the lucky Record-a-thon raffle winner of an iPad 2 (participants were given raffle tickets for each recording they uploaded, and Mr. Avila upload a bunch!):


  • Anita Suter speaking her native language Swiss German, in the Ostschweizer dialect. Standard German is one of the official languages of Switzerland, along with French, Italian and Romansch. Swiss German, with approximately 6.5 million speakers is the spoken variety of German used daily in Switzerland, and it has many dialects, many of which are unintelligible with each other. These dialects are used alongside Standard German, a spoken and written variety which is reserved for more official purposes, in a peaceful linguistic co-habitation known as 'diglossia':


  • Jordan Brown speaking Yiddish, a language he is studying. Several of the Record-a-thon participants made recordings in languages they are learning or studying. Mr. Brown, a linguistics student and Rosetta Project summer intern, made recordings in both Yiddish as well as in the unrelated Sri Lankan language Sinhala. Here he reads from the Yiddish translation of "Winnie the Pooh" by A.A. Milne. Yiddish is a Germanic language with about 2 million first language speakers and 11 million second language speakers in Israel, Germany, and worldwide:


During the Record-a-thon there were also several Mightyverse Phrase Farm recording stations set up and running all day, where participants could record vocabulary lists, as well as the Universal Declaration of Human Rights. These video files are more complex, but as soon as the files are processed, we hope to make them available at the Internet Archive as well:


Other highlights of the day included a keynote speaker by Dr. Elizabeth Lindsey. Dr. Lindsey is an Explorer at the National Geographic, and she inspired us with stories of her experiences on her current expedition to visit and document traditional knowledge-keepers around the world.


Thanks to all of our participants, and to our sponsors The Internet Archive, The Levenger Foundation and, The Long Now Foundation, and to our team of dedicated Rosetta Project Interns and volunteers, without all of whom this event would not have been possible.

We heart human languages - all of them!

Loading Comment Data...

Posted 3 years, 10 months ago by Summer Dougherty

New Audio Collection - Alan Lomax Parlemetrics

The Rosetta Project's newest addition to its online database is set of language recordings assembled by the famous ethnomusicologist Alan Lomax. This collection encompasses approximately 600 recordings of dozens of languages from around the world. The recordings were made primarily in the 60's and 70's by Alan Lomax and by linguists around the world to serve as raw material in Lomax's Parlametrics project, a "comparative study of conversational style." [1] Recordings include children singing in Puluwatese, family conversations in Telegu and stories and songs in Woleaian.

Though Lomax made some of the recordings himself , notably many of the ones made in the USSR, Italy and England, the rest were made by linguists around the world who helped Lomax by sending him tapes of their own field recordings. As Lomax had requested, the recordings consist mostly of five minute long snippets of conversation in various languages along with some telling of stories myths and singing of songs. Through a collaboration with the Association for Cultural Equity, the recordings were loaned to the Rosetta Project with the stipulation that the recordings be digitized. In 2005, Rosetta intern JD Ross Leahy digitized the vast majority of the recordings, approximately 270 reel-to-reel and cassette tapes, and the originals were sent to the Library of Congress for long term archiving. In 2011, Rosetta intern Summer Dougherty transcribed notes, inventoried, organized, and prepared the digital material for upload and in July 2011 the recordings were uploaded to the Internet Archive.

Ethnomusicologist and activist Alan Lomax is famous for his recordings of blues legends including Lead Belly, jazz musicians including Jelly Roll Morton and folk singers including Woody Guthrie. [2]

As a teenager, Lomax started helping his father, folklorist and musicologist John Lomax, collect folk songs. Lomax and his father partnered with the Library of Congress and by 1930, when Alan was 15, they had already contributed over 3,000 recordings to the library's collection. [3] Lomax's role as a microphone for under appreciated and marginalized folk singers brought folk music back into the attention of the public and spurred the folk revival in America, inspiring a new generation of artists, including Bob Dylan. Even British music was affected by Lomax: the Rolling Stones take their name from one of Muddy Waters' songs. [4] Even more recently, Lomax's recording of James Carter and other prisoners singing "Po' Lazarus" was used in the film "O Brother, Where Art Thou?". Other songs have been featured in “The Gangs of New York” and “Moby’s Play”. [5]

Lomax felt that folk music is vital expression of culture, and culture was very important to him. He believed in what he called "cultural equity", "the idea that the expressive traditions of all local and ethnic cultures should be equally valued as representative of the multiple forms of human adaptation on earth." [6] In fact, "his desire to document, preserve, recognize, and foster the distinctive voices of oral tradition led him to establish the Association for Cultural Equity (ACE), based in New York City and now directed by his daughter, Anna Lomax Wood." [7] "After 1960 he devoted himself to comparative research on world music and dance with collaborators from musicology, anthropology, dance, and linguistics." [8] These projects included his study of song, Choreometrics, of dance, Cantometrics and of speech, Parlametrics.


[1] Parlametrics (Association for Cultural Equity)

[2], [6], [8] Alan Lomax (Association for Cultural Equity)

[3] Alan Lomax, Who Raised Voice Of Folk Music in U.S., Dies at 87 (New York Times)

[4], [5] The Man who Recorded the World (Folkradio)

[7] The American Folklife Center: Alan Lomax Collection (The Library of Congress)

Loading Comment Data...

Posted 3 years, 11 months ago by

RECORD-A-THON - This Saturday July 30!

Join us for the Record-a-thon this Saturday July 30 at the Internet Archive and help document and promote the languages used in your own community! We need your help to meet our goal of recording 50 languages in a single day! How many languages can you help us document? Bring yourself and your multilingual friends and be the stars of your own grassroots language documentation project!

Keynote Speaker: Dr. Elisabeth Lindsey, National Geographic

![]( "")

Updated Schedule of Events!

Plan to attend in-person or remotely?

RSVP here through EventBrite!

(Tickets are free - your RSVP will allow us to prepare for numbers to expect and what equipment is going to be present, whether you intend to come in person or if you’re participating remotely.)
![]( "")


Loading Comment Data...

Posted 4 years ago by Laura Welcher



Help us record 50 languages in a single day!

Save the date! Saturday July 30, 02011 from 9 am to 6 pm
The Internet Archive
at 300 Funston Avenue, San Francisco

Did you know...

There is something you can do to help document and promote the languages used in your own community! We need your help to meet our goal of recording 50 languages in a single day! How many languages can you help us document? Bring yourself and your multilingual friends and be the stars of your own grassroots language documentation project!

Professional linguists and videographers will be on site to document you and your friends speaking word lists, reading texts, and telling stories. You can also document your language using tools you probably have in your purse or back pocket — a mobile phone, digital camera, or laptop — just bring your device and our team will guide you through the documentation process.

How do your words and stories make a difference? An important part of language documentation is building a corpus — creating collections of vocabulary words, as well as conversations and stories that demonstrate language in use. From a corpus, linguists and speech technologists can build grammars, dictionaries, and tools that enable a language to be used online. The bigger the corpus, the better the tools!

The recordings you make during the event will be added to The Rosetta Project's open collection of all human language in The Internet Archive. And, you can compete for cool prizes, including an iPad 2 for the participant who records and uploads the most languages during the event!

Please RSVP below and let us know if you plan to attend, and what language or languages you are thinking of recording. Can't make it to the Record-a-thon? Join us online the day of the event for the virtual Record-a-thon, where you'll be able to interact with event staff, monitor event progress, listen live to lectures and talks, and submit your own recordings remotely.

We will be in touch soon with more information about the day's events, and how you can participate! For questions or more information please contact


<< Older | Newer >>

Recent Comments

Powered by Disqus