There are about 7000 human languages spoken in the world today [Ethnologue 15], about 20-30 of which have a significant digital presence (English, Spanish, Mandarin, German, French and Japanese, for example) [Maxwell & Hughes 2006]. These and the next 270-280 most widely-spoken languages account for over 90% of the world's speakers; the other 10% or so of the population speaks one of 6700 minority languages. The 300 Languages Project recognizes the extreme challenge of representing these minority languages and seeks to develop an extensible protocol and a set of scalable, low-cost (i.e., volunteer-based) methods and standards for language documentation via the building of a "seed corpus" - a corpus which starts small but is designed to grow.
Once The 300 Languages Project is complete, not only will the seed for a universal corpus have been planted, but new types of multilingual research and technology development will be possible. To date, no public domain multilingual parallel text/audio corpus of this size exists; modes of research and development that depend on parallel text and audio content across many languages (like speech recognition and automated translation) have thus been limited to those smaller sets of languages for which such corpora do exist.
The 300 Languages Project is collecting translations and recordings of three important texts:
These texts were chosen primarily for (1) usefulness in linguistic research (for example, the Swadesh list contains words of special significance to historical and comparative linguists), and (2) breadth of existing translation (Genesis and the Universal Declaration of Human Rights are among the most widely-translated documents in the world). Collecting the same content in a number of languages not only serves to facilitate translation and other deductive processes (like the decoding of Egyptian hieroglyphs made possible by the ancient Rosetta Stone), but also provides a valuable resource for researchers in linguistics and language technology alike.
The 300 Languages Project is made possible through the support and sponsorship of Distinguished Career Professor and speech technology expert Dr. James K. Baker and is conducted in partnership with the ALLOW initiative of the Center for Innovations in Speech and Language at the Language Technologies Institute.
The Rosetta Disk