Posted 2 years, 11 months ago by
Posted 2 years, 11 months ago by
Posted 2 years, 11 months ago by Laura Welcher
While globalization is usually considered a primary factor in language endangerment, global economies also provide access to inexpensive communication technologies like the internet and mobile devices - and these technologies are increasingly enlisted as tools to increase the use of endangered languages, as reported recently in the BBC News.
Many endangered language speech communities are gravitating towards Twitter, as well as social media services like Facebook, to promote language use and language learning. For children especially, the ability to use their heritage language with these ubiquitous social media sites provides an essential "coolness" factor, giving their languages relevance and an important new domain of use in the modern world.
Those who use smaller languages on public sites like blogs, or Twitter, are creating an additional resource that they are probably unaware of: the language that they craft and post helps build a text corpus for their language that can pave the way for better tools to enhance that language's use online.
Dr. Kevin Scannell, a computer scientist, mathematician and endangered language speaker, has created a multilingual web crawler called An Crúbadán (which literally means "crawler" in Irish). The crawler identifies and computes the probability of 3-character sequences, which provide a unique "fingerprint" for any given written language. Here is an image showing the catch he netted with a recent trawl: over 1,000 different languages being used online (click on the image below for more information):
According to Scannell, the identified ever-growing corpora provide a means "to develop basic resources that help people use their language online: keyboard input methods, spell checkers, online dictionaries, and so on." In his other research, he has developed crawlers that explicitly capture endangered language Tweets (Indigenous Tweets) as well as blogs (Indigenous Blogs) which he says "aim to strengthen languages through social media."
Posted 2 years, 12 months ago by Alex Mensing
The September 02011 issue of the journal Language included an article entitled “A Cross-Language Perspective on Speech Information Rate,” by a team of linguists working with the University of Lyon and the French National Center for Scientific Research. Like many linguistic studies, this one investigates the parameters of human language and seeks to identify commonalities that hold true across languages. Given, however, that universal grammatical rules have proven more difficult to define than linguists might have hoped, this study was designed to test the universality of a different factor - time.
The authors hypothesize that “a trade-off is operating between a syllable-based average information density and the rate of transmission of syllables in human communication.” Basically, a language that is spoken quickly - in terms of syllables per second - uses more syllables than one that is spoken slowly in order to say the same thing.
To test their hypothesis they analyzed audio recordings of native speakers of seven different languages reading brief texts written in various styles. There were twenty texts, each composed in English and translated into the other languages. The authors compared the number of syllables that each language used in a given text, as well as the amount of time taken by speakers of different languages to actually say the entire texts. And indeed, they found that for the most part the languages whose texts used more syllables were spoken faster, and vice versa, resulting in equivalent rates of information output. Two complementary strategies for encoding and transmitting ideas.
"One has to consider the [...] loose hypothesis that [the information rate of the language] varies within a range of values that guarantee efficient communication, fast enough to convey useful information and slow enough to limit the communication cost (in its articulatory, perceptual, and cognitive dimensions)."
The premise here is that each translation of each of the texts communicates all of the information communicated in each other translation, adding and subtracting nothing, simply encoding the information according to the rules of each language. It would be interesting to discuss this premise with respect to the notion of linguistic relativity, which argues that your native language actually influences the way you perceive reality. Or in terms of issues such as evidentials or honorifics, which can require that certain information - and therefore more syllables - be included in a statement which in another language might be superfluous. Further research might also be able to analyze across more language families. The seven languages used in this study were English, German, French, Spanish, Italian, Mandarin Chinese and Japanese.
The authors also noted that the syllables themselves in quickly spoken languages are on average less complex, in that they are composed of fewer sounds (i.e. ‘law’ vs. ‘claw’ - both one syllable, but with different numbers of phonemes). As an initial investigation into the speed with which people communicate through speech, this is a fascinating study.
It seems that humans may be naturally and universally self-regulating when it comes to communicating through speech. There is a balance that cannot be disturbed: fast syllables are not allowed to carry too much meaning, and syllables with lots of information must be spoken slowly.
Posted 3 years ago by Laura Welcher
PanLex, the newest project under the umbrella of The Long Now Foundation, has an ambitious plan: to create a database of all the words of all of the world's languages. The plan is not merely to collect and store them, but to link them together so that any word in any language can be translated into a word with the same sense in any other language. Think of it as a multilingual translating dictionary on steroids.
You may wonder how this is different from some of the other popular translation tools out there. The more ambitious tools, such as Babelfish and Google Translate, try to translate sentences, while the more modest tools, such as Global Glossary, Glosbe, and Logos, limit their scope to individual words. PanLex belongs to the second, words-only, group, but is far more inclusive. While Google Translate covers 64 languages and Logos almost 200 languages, PanLex is edging close to 7,000 languages. With the knowledge stored in PanLex, translations can be produced extending beyond those found in any dictionary.
Here’s an example to give the basic idea of how it works. Say you want to translate the Swahili word ‘nyumba’ (house) into Kyrgyz (a Central Asian language with about 3 million speakers). You’re unlikely to find a Swahili–Kyrgyz dictionary; if you look up ‘nyumba’ in PanLex you’ll find that even among its half a billion direct (attested) translations there isn’t any from this Swahili word into Kyrgyz. So you ask PanLex for indirect translations. PanLex reveals translations of ‘nyumba’ that, in turn, have four different Kyrgyz translations. Three of these (‘башкы уяча’, ‘үй барак’, and ‘байт’) each have only one or two links to ‘nyumba’. But a fourth Kyrgyz word, ‘үй’, is linked to ‘nyumba’ by 45 different intermediary translations. You look them over and conclude that ‘үй’ is the most credible answer.
How confident can you be of your inferred translation—that Swahili ‘nyumba’ can be translated into Kyrgyz ‘үй’? After all, anyone who has played the game of “translation telephone” (where you start with Language A, translate into Language B, go from there to Language C and then translate back to Language A) will know this kind of circular translation can result in hilarious mismatches. But PanLex is designed to overcome “semantic drift” by allowing multiple intermediary languages. Paths from ‘nyumba’ to ‘үй’, for example, run through diverse languages from Azerbaijani to Vietnamese. Based on such multiple translation paths, translation engines can provide ranked “best fit” translations. As the database grows, especially in its coverage of “long tail” languages, possible translation paths will multiply, boosting reliability.
There are a couple of demonstrations that you can try with a browser. This will give you a sense of the magnitude of the data and the potential power of the database as a tool. One of these is TeraDict. If you enter a common English word like ‘house’ or ‘love’ you are likely to get translations into hundreds, or even thousands, of languages, and in some cases many translations per language. French, for example, has 25 translations for ‘house’ and 55 translations of ‘love’, including ‘zéro’ (hint: Think tennis!). Two similar interfaces allow you to explore the database in either Esperanto—InterVorto—or Turkish—TümSöz.
The second web tool, PanLem, is considerably more complicated and is used mostly by PanLex developers to enlarge and evaluate the database. But it’s publicly accessible. There is a step-by-step "cheat sheet" to help you climb the learning curve.
PanLex is an ongoing research project, with most of its growth yet to come, but the database already documents 17 million expressions and 500 million direct translations, from which billions of additional translations can be inferred.
PanLex is being built using data from about 3,600 bilingual and multilingual dictionaries, most in electronic form. The process of ingesting data into the database involves substantial curation and standardization by PanLex editors to ensure data quality. The next stage of collection will likely involve dictionaries that exist only in print form. It is hard to say how many are out there, but we expect it is on the order of tens of thousands. It is likely that most of these have not been scanned or digitized. Once they are, there will be a significant effort to improve the optical character recognition (OCR) for these materials—an effort which is likely to be highly informative to the development of OCR technology, since it will involve the human identification of many forms of many different scripts for languages around the world.
PanLex is working closely with the Rosetta Project. PanLex is a wonderful realization of the Rosetta Project’s original goal in building a massive, and massively parallel, lexical collection for all of the world’s languages.
Posted 3 years, 4 months ago by Austin Brown
The Rosetta Project at The Long Now Foundation is working to build an open public digital collection of all human language as well as an analog backup that can last for thousands of years–The Rosetta Disk. In the “long now,” the goal is long-term storage and access to information–on the scale that both supports and transcends individual human societies and civilizations. In the “here and now,” the project serves to support and amplify the importance of the world’s nearly 7,000 human languages, the vast majority of which are endangered and, if current trends continue, likely to go extinct in the next 100 years. I’ll present our current work on the Rosetta Project Collection and Disk as well as some new initiatives including the “Language Commons” where we are working to help build the multilingual Web.There will be a reception afterwards; come say Hello.
Posted 3 years, 4 months ago by Austin Brown
With thousands of languages and writing systems used all over the world, making computers and the web widely accessible has taken a herculean effort, with much yet to be done.
One of the main tools used in the expansion of the web’s global reach is Unicode - a database of over 193,000 characters from 93 different writing systems and the standards for using and representing them.
Unicode is maintained by The Unicode Consortium, which sponsors a conference each year to share knowledge and discuss the future of Unicode.
This year the Internationalization and Unicode Conference will be held October 17th - 19th in Santa Clara, CA.
Long Now’s Dr. Laura Welcher will be delivering a keynote presentation on Tuesday October 18th of her work on The Rosetta Project, a publicly accessible digital library of human languages, and The Language Commons:
The Rosetta Project shares the Unicode vision of a world where people can use communication technology on their own terms - in their own language. According to World Internet Statistics, over 80% of all web communication is in about ten languages, with over half in either English or Chinese. The remaining 20% represent "everyone else" including about 400 languages with speaker populations above 1 million, which collectively comprise about 95% of everyone on earth. Because of essential technologies like Unicode, we are poised to see this breadth of human languages flourish online and on mobile devices, providing for these languages a critical new domain of language use in the modern world. I will present several efforts underway at The Rosetta Project including the "Language Commons" that rely on Unicode as an essential technology in building the multilingual Web.
Posted 3 years, 6 months ago by Laura Welcher
On July 30, 02011 The Rosetta Project partnered with Mightyverse.com to hold the first human language Record-a-thon at the Internet Archive. This is an event we developed to test the idea that with a few basic guidelines, anyone can use common video devices to help document human language.
The idea is that by creating a 5-10 minute unedited video, and providing basic information about it - essentially just saying what language you think it is in - and then uploading it to the Rosetta Project collection in the Internet Archive, you are helping build a corpus of valuable data for that language. You don't need to be a specialist, and by archiving it you create a resource that others can build on, for many different useful purposes - from language learning and teaching, to linguistic analysis, to building the tools that enable a language to be used with modern technology.
This introductory talk by Dr. Laura Welcher, made the morning of the event, describes the ideas behind the creation of the Record-a-thon:
In the course of a single day, both in-person and remote partipants combined created about 85 videos in 34 different languages. There were speakers of all ages, native and non-native, some quite fluent while others were learners practicing their skills. All the videos they created are interesting to watch and are available here in the Rosetta Project video collection. They recorded conversations, told stories, histories, and jokes, recited poems, and sang lullabies. Here is a sampling (click on the images to see the videos):
During the Record-a-thon there were also several Mightyverse Phrase Farm recording stations set up and running all day, where participants could record vocabulary lists, as well as the Universal Declaration of Human Rights. These video files are more complex, but as soon as the files are processed, we hope to make them available at the Internet Archive as well:
Other highlights of the day included a keynote speaker by Dr. Elizabeth Lindsey. Dr. Lindsey is an Explorer at the National Geographic, and she inspired us with stories of her experiences on her current expedition to visit and document traditional knowledge-keepers around the world.
Thanks to all of our participants, and to our sponsors The Internet Archive, The Levenger Foundation and Levenger.com, The Long Now Foundation, and to our team of dedicated Rosetta Project Interns and volunteers, without all of whom this event would not have been possible.
We heart human languages - all of them!
Posted 3 years, 6 months ago by Summer Dougherty
The Rosetta Project's newest addition to its online database is set of language recordings assembled by the famous ethnomusicologist Alan Lomax. This collection encompasses approximately 600 recordings of dozens of languages from around the world. The recordings were made primarily in the 60's and 70's by Alan Lomax and by linguists around the world to serve as raw material in Lomax's Parlametrics project, a "comparative study of conversational style."  Recordings include children singing in Puluwatese, family conversations in Telegu and stories and songs in Woleaian.
Though Lomax made some of the recordings himself , notably many of the ones made in the USSR, Italy and England, the rest were made by linguists around the world who helped Lomax by sending him tapes of their own field recordings. As Lomax had requested, the recordings consist mostly of five minute long snippets of conversation in various languages along with some telling of stories myths and singing of songs. Through a collaboration with the Association for Cultural Equity, the recordings were loaned to the Rosetta Project with the stipulation that the recordings be digitized. In 2005, Rosetta intern JD Ross Leahy digitized the vast majority of the recordings, approximately 270 reel-to-reel and cassette tapes, and the originals were sent to the Library of Congress for long term archiving. In 2011, Rosetta intern Summer Dougherty transcribed notes, inventoried, organized, and prepared the digital material for upload and in July 2011 the recordings were uploaded to the Internet Archive.
Ethnomusicologist and activist Alan Lomax is famous for his recordings of blues legends including Lead Belly, jazz musicians including Jelly Roll Morton and folk singers including Woody Guthrie. 
As a teenager, Lomax started helping his father, folklorist and musicologist John Lomax, collect folk songs. Lomax and his father partnered with the Library of Congress and by 1930, when Alan was 15, they had already contributed over 3,000 recordings to the library's collection.  Lomax's role as a microphone for under appreciated and marginalized folk singers brought folk music back into the attention of the public and spurred the folk revival in America, inspiring a new generation of artists, including Bob Dylan. Even British music was affected by Lomax: the Rolling Stones take their name from one of Muddy Waters' songs.  Even more recently, Lomax's recording of James Carter and other prisoners singing "Po' Lazarus" was used in the film "O Brother, Where Art Thou?". Other songs have been featured in “The Gangs of New York” and “Moby’s Play”. 
Lomax felt that folk music is vital expression of culture, and culture was very important to him. He believed in what he called "cultural equity", "the idea that the expressive traditions of all local and ethnic cultures should be equally valued as representative of the multiple forms of human adaptation on earth."  In fact, "his desire to document, preserve, recognize, and foster the distinctive voices of oral tradition led him to establish the Association for Cultural Equity (ACE), based in New York City and now directed by his daughter, Anna Lomax Wood."  "After 1960 he devoted himself to comparative research on world music and dance with collaborators from musicology, anthropology, dance, and linguistics."  These projects included his study of song, Choreometrics, of dance, Cantometrics and of speech, Parlametrics.
 Parlametrics (Association for Cultural Equity)
, ,  Alan Lomax (Association for Cultural Equity)
 Alan Lomax, Who Raised Voice Of Folk Music in U.S., Dies at 87 (New York Times)
,  The Man who Recorded the World (Folkradio)
 The American Folklife Center: Alan Lomax Collection (The Library of Congress)
Posted 3 years, 7 months ago by
Posted 3 years, 8 months ago by Laura Welcher
Did you know...
There is something you can do to help document and promote the languages used in your own community! We need your help to meet our goal of recording 50 languages in a single day! How many languages can you help us document? Bring yourself and your multilingual friends and be the stars of your own grassroots language documentation project!
Professional linguists and videographers will be on site to document you and your friends speaking word lists, reading texts, and telling stories. You can also document your language using tools you probably have in your purse or back pocket — a mobile phone, digital camera, or laptop — just bring your device and our team will guide you through the documentation process.
How do your words and stories make a difference? An important part of language documentation is building a corpus — creating collections of vocabulary words, as well as conversations and stories that demonstrate language in use. From a corpus, linguists and speech technologists can build grammars, dictionaries, and tools that enable a language to be used online. The bigger the corpus, the better the tools!
The recordings you make during the event will be added to The Rosetta Project's open collection of all human language in The Internet Archive. And, you can compete for cool prizes, including an iPad 2 for the participant who records and uploads the most languages during the event!
We will be in touch soon with more information about the day's events, and how you can participate! For questions or more information please contact email@example.com.