The Rosetta Blog > Most Recent Posts

Loading Comment Data...

Posted 1 month, 1 week ago by Adrienne Mamin

New computer program deciphers a long-extinct language

Egyptian Hieroglyphs on The Rosetta Stone were deciphered by scholars, but a new computer program written at MIT could potentially accomplish the same feat today:

“'Traditionally, decipherment has been viewed as a sort of scholarly detective game, and computers weren't thought to be of much use,’ study co-author and MIT computer science professor Regina Barzilay said in an email.” (quoted in this recent writeup in the National Geographic Daily News).

The language in this case is Ugaritic, written in cuneiform and last used in Syria more than three thousand years ago. Archaeologists discovered Ugaritic texts in 1928, but linguists didn’t finish deciphering them for another four years. The new computer program did it in a couple of hours.

Ugaritic cuneiform

Ugaritic cuneiform

While an exciting and significant first step, the program is not a silver bullet solution to language decipherment. Human beings figured out Ugaritic long before the computer program came along, and it remains to be seen how well the program works with a never-before-deciphered language. Furthermore, the program relied on comparisons between Ugaritic and a known and closely related language, Hebrew. There are some languages with no known close relatives, and in those cases, the computer program would be at a loss.

Of course, we can’t be certain exactly how the technology may progress in the future. But with the Rosetta Disk designed to last for thousands of years, and with hundreds of languages classified in the Ethnologue as nearly extinct, an automated decoder of language documentation seems likely to prove useful eventually. It’s nice to know we’ve made a promising start.

Read more...

Loading Comment Data...

Posted 1 month, 1 week ago by Laine Stranahan

Building an Audio Collection for All the World's Languages

The Rosetta Project is pleased to announce the Parallel Speech Corpus Project, a year-long volunteer-based effort to collect parallel recordings in languages representing at least 95% of the world's speakers. The resulting corpus will include audio recordings in hundreds of languages of the same set of texts, each accompanied by a transcription. This will provide a platform for creating new educational and preservation-oriented tools as well as technologies that may one day allow artificial systems to comprehend, translate, and generate them.

Huge text and speech corpora of varying degrees of structure already exist for many of the most widely spoken languages in the world---English is probably the most extensively documented, followed by other majority languages like Russian, Spanish, and Portuguese. Given some degree of access to these corpora (though many are not publicly accessible), research, education and preservation efforts in the ten languages which represent 50% of the world's speakers (Mandarin, Spanish, English, Hindi, Urdu, Arabic, Bengali, Portuguese, Russian and Japanese) can be relatively well-resourced.

But what about the other half of the world? The next 290 most widely spoken languages account for another 45% of the population, and the remaining 6,500 or so are spoken by only 5%--this latter group representing the "long tail" of human languages:

Long_Tail_of_Languages.jpg

Equal documentation of all the world's languages is an enormous challenge, especially in light of the tremendous quantity and diversity represented by the long tail. The Parallel Speech Corpus Project will take a first step toward universal documentation of all human languages, with the goal of providing documentation of the top 300 and providing a model that can then be extended out to the long tail. Eventually, researchers, educators and engineers alike should have access to every living human language, creating new opportunities for expanding knowledge and technology alike and helping to preserve our threatened diversity.

This project is made possible through the support and sponsorship of speech technology expert James Baker and will be developed in partnership with his ALLOW initiative. We will be putting out a call for volunteers soon. In the meantime, please contact rosetta@longnow.org with questions or suggestions.

Read more...

Loading Comment Data...

Posted 1 month, 1 week ago by Adrienne Mamin

Rosetta Spotlight: Yurok - a critically endangered language

Yurok_numerals.png

Description of Yurok numerals in the Rosetta archive

Yurok (YUR) is the language of the Yurok people of northwestern California. As with most indigenous American languages, European contact has mostly come to replace Yurok with English, so that as of 2009 it is near extinction. Yurok belongs to the Algonquian language family, most of whose other members are geographically distant from Yurok. Accordingly, Yurok is surrounded by languages unrelated to it, except for the only distantly related (and extinct) Wiyot.

Yurok has a set of glottalized consonants (sounds produced with the glottis closed, as if holding your breath) that contrast with their nonglottalized counterparts. The glottalized sounds are less common but are important in Yurok morphology, such as verb conjugations.

Some verbs must inflect (be conjugated) for person and number, others cannot, and many can go either way. For example, the word for eating must take different endings according to the subject: nepek’ for ‘I eat,’ nepe’m for ‘you (singular) eat,’ nep’ for ‘s/he eats,’ nepoh for ‘we eat,’ nepu’ for ‘you (plural) eat,’ and nepehl for ‘they eat.' On the other hand, chek ‘sit,’ always maintains the same form no matter who and how many are sitting. Finally, skewok ‘want’ can remain skewok for all subjects, or it can inflect as skewoksimek’ ‘I want,’ skewoksime’m ‘you (singular) want,’ skewoksi’m ‘s/he wants,’ etc., just as the verb ‘eat’ does.

Yurok has no distinct category of adjectives; the words that translate to adjectives or express adjective-like meanings behave like verbs in terms of word order and inflection. For example, there is a word for being big that inflects just as verbs do: peloyek’ ‘I am big,’ peloye’m ‘you are big,’ pelo’y ‘s/he is big,’ etc. Numerals are also a type of verb, and they have different forms according to the type or shape of thing being enumerated (for example, humans versus animals, or flat things versus tufted things).

Ways of writing Yurok have varied over time and remain not entirely settled. In the 1980s the Yurok Language Committee adopted UNIFON, designed (by an economist) as an English pronunciation key. However, UNIFON was impractical and therefore unpopular, and the Yurok Language Committee adopted an alternative system, which was later revised by linguists working on the language (as Leanne Hinton details in her unpublished 2010 article "Orthography Wars"). The Berkeley Yurok Language Project, a searchable collection of Yurok stories, words, and morphemes, lists entries in both the original alternative system and the revised system.

UNIFON alphabet

UNIFON alphabet

Read more...

Loading Comment Data...

Posted 2 months, 3 weeks ago by Sarina Spector

Rosetta Spotlight: Lakota – an oral language enters the digital age

Lakota phoneme chart

Lakota [1], the language of the Lakota tribe of the Great Plains, is fading before its speakers' eyes. Although Lakota is one of the most robust Native American languages today, its speaker population has fallen far since its peak in pre-colonial times and continues to dwindle. This reflects the experience of many native tribes, and is largely a result of US government policies concerning these peoples. Lakota speakers (the Ethnologue puts their number around 6,300) are left in danger of losing not only their language but the vital cultural information it holds.

Lakota, like most of the world's languages, was not originally written, and much of the long tradition and history of the Lakota exists only orally in their stories and ceremonies. The Lakota people did, however, keep detailed historical records, as can be seen in the "Lakota Winter Counts," now archived online on the website of the Smithsonian National Anthropological Archives. These are pictographic calendars detailing important historical events in the lives of the Lakota.

An alphabetic writing system for the Lakota language in use for the past four decades has now been widely adopted by Lakota speakers. And, in a modern effort to revitalize the Lakota language, the Lakota Language Consortium has compiled textbooks from introductory to college level and an expansive online forum to assist children and adults in learning and thereby preserving the language.

They have also compiled a 20,000-word dictionary of Lakota, including wonderfully complex words like "woímnayankel," which expresses the humbled yet connected feeling one experienced when witnessing something particularly majestic in nature, such as the aurora borealis. Lakota words are often this complex, efficiently expressing ideas that would take a sentences or two in English. Efforts like the Lakota Language Consortium allow the Lakota language to not only survive but flourish, giving future generations the chance to embody and spread the culture of their ancestors.

The Rosetta Project's collection on the Internet Archive has records of the Lakota language in the form of three text excerpts: a description of where Lakota was historically spoken; a phonology, which uses a chart to characterize phonemes by linguistic traits; and an orthography, or explanation of the Lakota writing system.

[1] The Lakota were historically known as the Sioux, but this is an exonym from their Algonquian neighbors to the east, and the term is deprecated today.

Read more...

Loading Comment Data...

Posted 3 months, 1 week ago by Laura Welcher

The Rosetta Project at Maker Faire 02010

Rosetta Project linguists and archivists traveled to Maker Faire this past weekend to demo the Rosetta Disk for a crowd of nearly 80,000 people. We brought the first and second prototypes of the Rosetta Disk, and set up a microscope with a camera to view Disk pages up close. We also had a "Digitization Station" where Maker Faire attendees could watch and participate in the collection of language documentation for the disk.

Would you like to help translate the subtitles of this video? You can here at dotSUB.

Read more...

Loading Comment Data...

Posted 3 months, 3 weeks ago by Sarina Spector

Rosetta Spotlight: Ormuri - a piece of Middle Eastern identity

Ormuri Description in the Rosetta Collection

"Language is identity," Darfur refugee Daowd I. Salih told the New York Times about a week ago. He was being interviewed for an article called "Listening to (and Saving) the World's Languages." As mentioned in this Rosetta Project blog post, the article discusses the amazing variety of spoken languages in New York City, and what residents are doing (or not doing) to preserve their native language.

One of the languages the article touches on is Ormuri, a language of multiple dialects spoken in small regions of Afghanistan and Pakistan. According to the Ethnologue, Ormuri has only about 1,050 speakers. The New York Times article reveals a plan to canvass New York City for speakers of Ormuri in order to learn more about the language and the cultural information it holds.

Languages with small speaker populations are quickly dying out, and the data they contain (whether it be linguistic, historical, or cultural) is important enough to merit a concerted effort at saving them. Ormuri is a perfect example, especially in the political and economic environment of our time (read: the complex tangle that is our current Middle Eastern relations). The Rosetta Project's database in the Internet Archive contains a detailed description of Ormuri, including a history of its speakers: where they came from, who their ancestors are, and how their language has co-evolved with those around it to become what it is today.

In my mind there is nothing that illustrates a culture's unity so much as its language. It allows people to build social relationships, conduct business transactions, and express to fellow humans everything they hold dear. What's more, as any good anthropologist knows, learning the language of a culture is one of the most important steps an outsider can take to gain the trust and respect of its people.

What does this have to do with an obscure Afghan language, or with Darfur refugees? Only this: if we intend to successfully navigate the conflicts of the modern global world, it is absolutely necessary to understand and relate to the people with whom we intend to work. The Middle East in particular, Afghanistan being an illustrative example, is culturally very foreign to the West; its people have lived for centuries in small, autonomous groups that hold to varied, often contradictory beliefs. The fact that so many of these groups have their own language, like Ormuri, is telling of their relative isolation, and gives clues to how they live their lives.

Rosetta's description of Ormuri tells the story of its peoples' interactions through Ormuri's morphology. By studying the languages Ormuri had contact with and how these influenced its words, we can begin to create a web of social and economic interaction that would show the connections and dissociations between groups in the area. For example, Ormuri has many morphological similarities to Pashto, a common language in the region of Waziristan where Ormuri is spoken. Ormuri pronouns are strikingly similar to their Pashto equivalents, and many scattered words share similarities, like "wife," "glitter," and "to sit down." Pashto has also phonetically influenced Ormuri, replacing some traditional Ormuri allophones with similar Pashto ones.

Ormuri has also sustained contact with Persian, which is evident in many morphological changes that mimic the latter: loss of gendered nouns, simplification of plural nouns, and reduction of irregular past participles. Analyzing this data led the author, Georg Morgenstierne, to doubt the previous belief that Ormuri speakers descend from Kurds, and provided evidence for further theoretical investigations.

The very existence of this kind of knowledge is what Rosetta is all about; by preserving minority languages and stressing their importance, we hope to contribute vital insights into the lives of their speakers, insights that can be put to good use in surprising places. After all, you never know who you'll meet on the New York City subway.

[A note of introduction: this is my first post as an intern with the Rosetta Project. I will be working with Rosetta for three months, building the collection in the Internet Archive and continuing to spotlight Rosetta material on this blog.]

Read more...

Loading Comment Data...

Posted 4 months ago by Laura Welcher

"Diaspora Sourcing" the Documentation of Endangered Languages



The New York Times ran an article today about endangered languages spoken within the New York City immigrant population - by some estimates as many as 800 languages are represented:

"In addition to dozens of Native American languages, vulnerable foreign languages that researchers say are spoken in New York include Aramaic, Chaldic and Mandaic from the Semitic family; Bukhari (a Bukharian Jewish language, which has more speakers in Queens than in Uzbekistan or Tajikistan); Chamorro (from the Mariana Islands); Irish Gaelic; Kashubian (from Poland); indigenous Mexican languages; Pennsylvania Dutch; Rhaeto-Romanic (spoken in Switzerland); Romany (from the Balkans); and Yiddish."

The article designates New York City as "the most linguistically diverse city in the world." I don't know if that is in fact true, but it seems likely.

For the United States, there is data compiled by the US Census on language use - since 1980, the long form of the census has asked several questions about language use - Does this person speak a language other than English at home? If so, what is this language (fill in the blank)? How well does this person speak English (very well, well, not well, not at all)? Since the long form is distributed to one in ten households, the smaller the group the less accurate the count tends to be. Still, the numbers give some idea, and it is always interesting to see what languages get listed.

For 2008, the US Census compiled data on the languages spoken at home for cities with 100,000 or more people. The column that is especially interesting for endangered languages is the "other languages" category - that is not English or Spanish, not other Indo-European, and not Asian or Pacific. New York City tops this list with 179,000 speakers of other languages (60,000 of whom are dominant in this language). Los Angeles is next with 43,000 speakers of other languages. San Francisco is #32 on the list with 5,700 speakers of other languages.

To make better use of the wealth of linguistic diversity in their own backyard, Daniel Kaufman and colleagues at The City University of New York have started the independent Endangered Language Alliance - "an urban initiative for endangered language research and conservation." This is, in fact, a time honored tradition among linguistic graduate students and faculty who lack time or resources to travel. But judging by the numbers of speakers of small languages in large cities, and the rapid loss of small languages around the world, this kind of program is just plain smart. Having more of them in urban locales - or maybe existing programs like StoryCorps - could use "diaspora sourcing" to make a big impact in the documentation and revitalization of endangered languages.

Read more...

Loading Comment Data...

Posted 6 months ago by Laura Welcher

The Global Lives Project - World Premiere Opening Night

Last Friday evening, Long Now joined the Global Lives Project in celebrating their world premiere opening at San Francisco's Yerba Buena Center for the Arts. Through a huge volunteer effort, Global Lives has produced ten films - each 24 hours long - that visually capture the everyday life of ten people around the planet. And on Friday we could view them all, at the same time, in the same room. Ten huge screens hung from the ceiling of the Yerba Buena Forum and around a thousand people throughout the evening ambled around and under them, listening as voices emerged -- Kai Lu, from Anren China speaking to his wife in a village dialect of Sichuan Yi, young Edith Kaphuka from Ngwale Village, Malawi code-switching with her friends on the playground between Chichewa and Chiyao, James Bullock of San Francisco chatting up the tourists on his cable car in West Coast American English. Some screens showed people working, others playing, some eating, others sleeping -- a glimpse into one human day on planet earth.

Global Lives Opening - Installation in the Forum

A second ongoing installation in the YBCA Room for Big Ideas provides a more intimate viewing space, with ten partitioned rooms and LCD viewing screens. Each room is furnished with seating for one or two, with walls and floors embellished with fabrics, colors and textures evocative of the region of the film. Kiosks and wall graphics give a bit of background about the project, and the ten participants. And while the installation as a whole gives the sense of a finished, polished project, three computers set up prominently in the room tell a different - and quite wonderful - story.

Global Lives Project - Installation in YBCA Room for Big Ideas

This is not a finished project - in fact, it is very much a work in progress. One of the greatest ongoing efforts is one that anyone can help with - the subtitling of each film in as many languages as possible (through the crowdsourced subtitling site dotSUB). The first pass was getting all ten films subtitled in English for the opening night, and that effort is still only about 80% done. It is an enormous effort.

Jason Price, one of the producers of the Malawi shoot, tells the story of being nearly at wit's end trying to find anyone to help translate Edith Kaphuka's Chichewa into English -- until someone suggested he set up a Facebook Group, and then 2,500 mostly expatriate Chichewa speakers arrived ready to help (there are, of course, many speakers of Chichewa in Malawi, but the need to access streaming video to do the translations made that nearly impossible).

Through the steadfast effort of about 25 of these people, the full twenty four hours of video has now not only been transcribed and translated, but put thorough about five stages of checking, rechecking and review to ensure its accuracy. And, it is now the largest corpus of spoken transcribed Chichewa on the web. (What might this 'seed' corpus enable down the road? Chichewa online dictionaries? Spell checkers? Natural language processing? Search? This group of translators may, without realizing it, be forging the way for a real Chichewa language online presence.)

For Global Lives, this set of ten videos is just the beginning of a much larger library of human life experience. Not grand experiences, not Hollywood, not Bollywood -- in the words of David Harris, the project's director (responding to the umpteenth activist proposal, this one by yours truly) "we want boring!" Because what we see as the everyday, the mundane, the routine is in fact a picture of our own humanity - and for that each Global Lives shoot is worth a thousand Hollywood productions.

The Global Lives installation in the Room for Big Ideas will be open through June 20, 02010 at San Francisco's Yerba Buena Center for the Arts. The Long Now Foundation sponsored the world premiere installation in the YBCA Forum through a grant from The William and Flora Hewlett Foundation.

Loading Comment Data...

Posted 8 months, 3 weeks ago by Austin Brown

Mumble in the Jungle

Campbells Monkey

This week, the New York Times ran an article about a recent scientific discovery in the predator alert calls of Campbell's monkeys. Strikingly, they seem to have the ability to create complex calls out of multiple elements - a "morphological" (word building) process previously thought to only take place in human language.

Human languages do this all the time - for example the word 'walked' is built of two morphemes, one carrying the main verbal action 'walk' and the other marking past tense '-ed'. In the case of the Campbell's monkey, morphemes are often combined to indicate different types of threats. Previous observations of monkeys have shown that they sometimes use different types of calls for different types of predators, but what's unique about these calls is that some of them can be combined with other calls to change their meaning. So, instead of just having a "jaguar!" call and an "eagle!" call as has been observed in Vervet monkeys, Campbell's monkeys have a "leopard!" call that can be combined with a suffix that changes its meaning to indicate a less specific threat:

Crucially, “krak” calls were exclusively given after detecting a leopard, suggesting that it functioned as a leopard alarm call, whereas the “krak-oo” was given to almost any disturbance, suggesting it functioned as a general alert call. Similarly, “hok” calls were almost exclusively associated with the presence of a crowned eagle (either a real eagle attack or in response to another monkey's eagle alarm calls), while “hok-oo” calls were given to a range of disturbances within the canopy, including the presence of an eagle or a neighbouring group (whose presence could sometimes be inferred by the vocal behaviour of the females).

- Ouattara, Lemasson & Zuberbühler

Just as artificial intelligence researchers have been busy over the last several decades celebrating each previously-unique human capacity achieved by computers, biologists have been finding behaviors once thought to mark the uniqueness of humans in other animals. Neurobiologist and primatologist Robert Sapolski recently gave a lecture at Stanford about the uniqueness of humans, which provides a great overview of what we share and don't share with other animals (as is currently understood).

Similarly, primatologist Frans de Waal has made a career of describing the political, cultural, emotional and moral lives of primates. His work has illustrated the evolutionary breadth and depth of many human characteristics previously thought to be recent behavioral innovations without precedent and unique to our species.

As artificial intelligence research looks forward to recreating human capabilities it focuses our efforts to understand those capabilities. Similarly, in identifying in other animals capacities like syntax once thought to be unique to humans, we are afforded a clearer look back on the deep history and development of those capacities. Looked at this way, it actually did take millions of years to produce the works of Shakespeare.

Loading Comment Data...

Posted 9 months, 1 week ago by Laura Welcher

Human Language as a Secret Weapon

Navajo_Code_Talkers

Earlier this month, a small group of World War II Navajo Code Talkers – who are today in their eighties and nineties – marched as a group for the first time in the New York City Veteran's Day Parade as a way to raise awareness in the US about their wartime contribution. The Code Talkers were Navajo speakers recruited by the U.S Military for sending coded verbal messages by radio in World War II – an effort legendary today as producing “the only unbroken code in modern military history.”

This caught my attention partly because Navajo is a threatened language – while there are 150,000 speakers at last count and several thousand monolinguals, the word on the wire is that Navajo is losing ground to English among the youngest in the Navajo community – and children are, after all, the ones who decide a language’s fate.

I also had this question in the back of my mind – could a human language be used in such a way today? Granted, we have sophisticated computer encryption that pretty much renders any human generated code obsolete. But say for a moment that we didn’t, or couldn’t use digital technology… do we simply know too much about what is possible in human language? And failing that, is there any language out there esoteric and isolated enough that it could be put to such use?

First, to clarify, there is nothing inherent about the Navajo language that made the code uncrackable – a quick perusal of the recent press turns up descriptors like “ancient language” and “complex grammar” which could apply to any human language. The phrase “near isolate” also doesn’t make sense because Navajo is a language with many linguistic relatives in the Athabaskan group throughout the Southwestern US, Canada, and Alaska.

What made the code uncrackable at the time was a combination of factors – physical and social isolation of the Navajo speech community certainly did, as few non-Navajos spoke the language. Also, little was known linguistically about the language at the time, and linguistics outside of philology was itself a fledgling field of study. Most importantly, the code wasn’t just everyday Navajo, but a cipher based on Navajo with word-replacements like “tortoise” for tank or “iron fish” for submarine as well as Navajo substitutions for English military acronyms. A Navajo speaker was in fact captured and tortured for his knowledge at Bataan, but since he didn’t know the cipher, he was just as befuddled as everyone else.

I wonder though whether a linguist today with a basic knowledge of the language, and/or access to basic tools like a grammar and dictionary, transported back to that time might have figured it out, given enough data and the context in which the messages were delivered. A relatively few cracked messages could render the essential cryptographic key. Do all human languages have such basic description? Far from it. My best guess based on what we’ve been able to find for The Rosetta Project is maybe one half of all human languages? A third? Without this, the decryption task would have to encompass basic linguistic analysis as well.

So is it possible that a human language in this day and age could serve the purpose? Maybe, maybe not -- I welcome discussion. But if not – and here’s the real question on my mind – are we linguists done? Can we pack up our bags and go home? Although I think we understand something about human language – maybe a lot more than we did 70 years ago, it would be extreme hubris to say we really get all there is to human language at this point. I expect there are plenty of surprises in store even as far as grammatical structure is concerned – and at every level of structure. Many of the more interesting questions are likely to relate to how language is used in its cultural context -- like the Pirahã avoiding speaking about the remote past because it is inaccessible to eyewitness verification.

That many lifetimes could be spent puzzling it all out is one of the great joys of linguistic discovery. And to my way of thinking, the surprises about our human selves that lie in store is a primary reason to pursue language documentation as one of the great scientific and intellectual enterprises of our era.

<< Older

Recent Comments

Powered by Disqus