Save the Words is a nifty site by the Oxford English Dictionary where lexophiles can adopt an esoteric or obsolete word and revive its use. And get the t-shirt. It would be a fun educational tool in the context of an endangered language, where all of the words need saving.
Join Long Now's Rosetta Project on November 4 from 4 - 7 pm at UCLA's Hammer Museum where we team up with San Francisco-based CRITTER for an Enormous Microscopic Evening. We'll put a Rosetta Disk under the microscope, check out the fine (and finer) print, and maybe hunt for Easter eggs... More information on the evening's lineup from the Hammer Museum:
Enormous Microscopic Evening examines the museum from a microscopic perspective with CRITTER, a San Francisco-based salon dedicated to expanding the relationships between culture and the environment. The evening will focus on demonstrations and workshops about building and manipulating microscopes. Materials and samples taken from around the museum will be examined. Continuing the theme of microscopy, there will be micro performances (short concerts with tiny instruments) and other related events throughout the museum.
Jessie Little Doe Baird, a linguist who has worked for years on reviving the Wampanoag (Wôpanâak) Language, has just been awarded a 02010 MacArthur "Genius" Fellowship in honor of her work and research.
Baird, who is of Wamponoag heritage, studied at MIT under the indigenous language scholar Kenneth Hale. By immersing herself in the language, she has achieved fluency, effectively reviving in herself the spoken use of the long-silent language. Her research is focused on developing a dictionary of Wampanoag, which now includes nearly 10,000 words, as well as language teaching resources, through which she hopes to help usher the language into modern use in the Wampanoag community.
The 300 Languages Project is a special effort by The Rosetta Project to create a parallel text and audio corpus for the world's 300 most widely-spoken languages. We are seeking a limited set of volunteers to test its submission process and offer feedback to its coordinators before the project is globally launched in November. Native speakers of any language (including English) are encouraged to participate.
Swadesh list for the Puoc language in the International Phonetic Alphabet
In the 01950s, American linguist Morris Swadesh, as part of his overarching vision of a quantitative method for determining language relationships on a global and multimillenial scale, developed a set of one hundred words found to be unusually stable across time and language boundaries. Swadesh hypothesized that words like "fire," "moon," "mother" and "bone," common to human experience, were far less likely to change or be substituted with words borrowed from other dialects or languages. The 100 word "Swadesh list" (sometimes up to 207, depending on the variety of the list used) is now widely collected in linguistic field research, and functions as a kind of universal linguistic fossil. With careful study, these lists can reveal ancient language relationships and processes of linguistic change typically obscured by centuries-long processes of evolution and borrowing. As familiar examples, such processes transformed Chaucer's English into modern English and Latin into the modern Romance Languages.
In 02004, The Rosetta Project undertook a National Science Foundation funded project to increase both the size and utility of its long-term multilingual archive and at this time added a large number of Swadesh lists to its collection. Lexical database archivists Tim Usher and Paul Whitehouse contributed original research (Tim Usher's 02002 Indo-Pacific database and Paul Whitehouse's 02002 Australian and New Guinea database were central among the additions) and also brought in outside resources, including Darrell Tryon's Comparative Austronesian Dictionary (01995), George Starostin's Dravidian database, and Ilya Peiros' Mon Khmer database. In many of these cases, as with the Usher and Whitehouse collection, the 100-200 term Swadesh lists were a subset of a larger lexical data collection project. Despite the Swadesh list's limitation in size compared with a resource like a dictionary, a large collection of the same material in many different languages is useful as a parallel dataset for cross-linguistic comparison.
This collection of Swadesh lists was included as a parallel data set among the documents micro-etched on the Rosetta Disk, a physical copy of The Rosetta Project's long-term linguistic archive created in 02008. And for a period of time, the lists were available on The Rosetta Project's website via an interactive tool which allowed visitors to view and compare lexical items in over a thousand languages and also contribute their own lexical data. But as the Rosetta Project site evolved and the structure of serving environments changed, this tool became technologically obsolete. While there was (and remains) no lack of storage space for the lists, there was a critical lack of what Long Now board member Kevin Kelly calls "movage."
"Movage," says Kelly, "means transferring the material to current platforms on a regular basis — that is, before the old platform completely dies, and it becomes hard to do. This movic rhythm of refreshing content should be as smooth as a respiratory cycle — in, out, in, out. Copy, move, copy, move." And it is movage, not storage, says Kelly, that is critical to keeping information alive: "The only way to archive digital information is to keep it moving." In other words, simply storing data isn't enough to ensure its longevity; it must be copied, moved, and made redundant. And not just once or twice — indefinitely. Kurt Bollacker, Long Now Foundation Digital Research Director, adds: "[b]ecause any single piece of digital media tends to have a relatively short lifetime, we will have to make copies far more often than has been historically required of analog media. Like species in nature, a copy of data that is more easily “reproduced” before it dies makes the data more likely to survive." 
Since the 02004 iteration of the Swadesh list program, The Rosetta Project has launched a comprehensive migration of all of its data to The Internet Archive, a free online digital library founded in 01996 with over 4 petabytes of storage. The Internet Archive exemplifies the paradigm shift in the field of information preservation from storage to movage: users of the site can upload any document they have permission to distribute to the site for free, where anyone with access to the internet can then download it to their own machine. Thousands of downloads are made every day from Internet Archive servers by users all over the world: early "movage" on a massive scale.
After a long process of unraveling and decoding the Swadesh list data, which had fallen victim to rapid changes in character encoding and database standards, The Rosetta Project has now moved the collection of 1,235 Swadesh lists into The Internet Archive. Recognizing the substantial merit and long-term advantages of the movage model and its successful early implementation by The Internet Archive, our goal is for the lists to have a long, useful, and redundant residence there.
The relocation of the Swadesh lists is also the first step of The Rosetta Project's latest undertaking, The 300 Languages Project. Source materials collected for The 300 Languages Project, whose aim is to address a need for highly-structured linguistic resources in the world's 300 most widely-spoken languages, will be stored at The Internet Archive with the rest of The Rosetta Project collection.
Was the 5-to-6-year period the Swadesh list data spent in the darkness unusual? According to Kelly, not at all: "We don’t know what the natural movage respiration cycle is for digital media yet since it is still very new," says Kelly, "but I suspect the cycle is much shorter than we think. I would guess it is 5 years. No matter what digital format you have your precious [data] stored on, you should expect to move it onto new media in five years — and five years after that forever!"
Egyptian Hieroglyphs on The Rosetta Stone were deciphered by scholars, but a new computer program written at MIT could potentially accomplish the same feat today:
“'Traditionally, decipherment has been viewed as a sort of scholarly detective game, and computers weren't thought to be of much use,’ study co-author and MIT computer science professor Regina Barzilay said in an email.” (quoted in this recent writeup in the National Geographic Daily News).
The language in this case is Ugaritic, written in cuneiform and last used in Syria more than three thousand years ago. Archaeologists discovered Ugaritic texts in 1928, but linguists didn’t finish deciphering them for another four years. The new computer program did it in a couple of hours.
While an exciting and significant first step, the program is not a silver bullet solution to language decipherment. Human beings figured out Ugaritic long before the computer program came along, and it remains to be seen how well the program works with a never-before-deciphered language. Furthermore, the program relied on comparisons between Ugaritic and a known and closely related language, Hebrew. There are some languages with no known close relatives, and in those cases, the computer program would be at a loss.
Of course, we can’t be certain exactly how the technology may progress in the future. But with the Rosetta Disk designed to last for thousands of years, and with hundreds of languages classified in the Ethnologue as nearly extinct, an automated decoder of language documentation seems likely to prove useful eventually. It’s nice to know we’ve made a promising start.
The Rosetta Project is pleased to announce the Parallel Speech Corpus Project, a year-long volunteer-based effort to collect parallel recordings in languages representing at least 95% of the world's speakers. The resulting corpus will include audio recordings in hundreds of languages of the same set of texts, each accompanied by a transcription. This will provide a platform for creating new educational and preservation-oriented tools as well as technologies that may one day allow artificial systems to comprehend, translate, and generate them.
Huge text and speech corpora of varying degrees of structure already exist for many of the most widely spoken languages in the world---English is probably the most extensively documented, followed by other majority languages like Russian, Spanish, and Portuguese. Given some degree of access to these corpora (though many are not publicly accessible), research, education and preservation efforts in the ten languages which represent 50% of the world's speakers (Mandarin, Spanish, English, Hindi, Urdu, Arabic, Bengali, Portuguese, Russian and Japanese) can be relatively well-resourced.
But what about the other half of the world? The next 290 most widely spoken languages account for another 45% of the population, and the remaining 6,500 or so are spoken by only 5%--this latter group representing the "long tail" of human languages:
Equal documentation of all the world's languages is an enormous challenge, especially in light of the tremendous quantity and diversity represented by the long tail. The Parallel Speech Corpus Project will take a first step toward universal documentation of all human languages, with the goal of providing documentation of the top 300 and providing a model that can then be extended out to the long tail. Eventually, researchers, educators and engineers alike should have access to every living human language, creating new opportunities for expanding knowledge and technology alike and helping to preserve our threatened diversity.
This project is made possible through the support and sponsorship of speech technology expert James Baker and will be developed in partnership with his ALLOW initiative. We will be putting out a call for volunteers soon. In the meantime, please contact email@example.com with questions or suggestions.
Yurok (YUR) is the language of the Yurok people of northwestern California. As with most indigenous American languages, European contact has mostly come to replace Yurok with English, so that as of 2009 it is near extinction. Yurok belongs to the Algonquian language family, most of whose other members are geographically distant from Yurok. Accordingly, Yurok is surrounded by languages unrelated to it, except for the only distantly related (and extinct) Wiyot.
Yurok has a set of glottalized consonants (sounds produced with the glottis closed, as if holding your breath) that contrast with their nonglottalized counterparts. The glottalized sounds are less common but are important in Yurok morphology, such as verb conjugations.
Some verbs must inflect (be conjugated) for person and number, others cannot, and many can go either way. For example, the word for eating must takedifferent endings according to the subject: nepek’ for ‘I eat,’ nepe’m for ‘you (singular) eat,’ nep’ for ‘s/he eats,’ nepoh for ‘we eat,’ nepu’ for ‘you (plural) eat,’ and nepehl for ‘they eat.' On the other hand, chek ‘sit,’ always maintains the same form no matter who and how many are sitting. Finally, skewok ‘want’ can remain skewok for all subjects, or it can inflect as skewoksimek’ ‘I want,’ skewoksime’m ‘you (singular) want,’ skewoksi’m ‘s/he wants,’ etc., just as the verb ‘eat’ does.
Yurok has no distinct category of adjectives; the words that translate to adjectives or express adjective-like meanings behave like verbs in terms of word order and inflection. For example, there is a word for being big that inflects just as verbs do: peloyek’ ‘I am big,’ peloye’m ‘you are big,’ pelo’y ‘s/he is big,’ etc. Numerals are also a type of verb, and they have different forms according to the type or shape of thing being enumerated (for example, humans versus animals, or flat things versus tufted things).
Ways of writing Yurok have varied over time and remain not entirely settled. In the 1980s the Yurok Language Committee adopted UNIFON, designed (by an economist) as an English pronunciation key. However, UNIFON was impractical and therefore unpopular, and the Yurok Language Committee adopted an alternative system, which was later revised by linguists working on the language (as Leanne Hinton details in her unpublished 2010 article "Orthography Wars"). The Berkeley Yurok Language Project, a searchable collection of Yurok stories, words, and morphemes, lists entries in both the original alternative system and the revised system.
Lakota , the language of the Lakota tribe of the Great Plains, is fading before its speakers' eyes. Although Lakota is one of the most robust Native American languages today, its speaker population has fallen far since its peak in pre-colonial times and continues to dwindle. This reflects the experience of many native tribes, and is largely a result of US government policies concerning these peoples. Lakota speakers (the Ethnologue puts their number around 6,300) are left in danger of losing not only their language but the vital cultural information it holds.
Lakota, like most of the world's languages, was not originally written, and much of the long tradition and history of the Lakota exists only orally in their stories and ceremonies. The Lakota people did, however, keep detailed historical records, as can be seen in the "Lakota Winter Counts," now archived online on the website of the Smithsonian National Anthropological Archives. These are pictographic calendars detailing important historical events in the lives of the Lakota.
An alphabetic writing system for the Lakota language in use for the past four decades has now been widely adopted by Lakota speakers. And, in a modern effort to revitalize the Lakota language, the Lakota Language Consortium has compiled textbooks from introductory to college level and an expansive online forum to assist children and adults in learning and thereby preserving the language.
They have also compiled a 20,000-word dictionary of Lakota, including wonderfully complex words like "woímnayankel," which expresses the humbled yet connected feeling one experienced when witnessing something particularly majestic in nature, such as the aurora borealis. Lakota words are often this complex, efficiently expressing ideas that would take a sentences or two in English. Efforts like the Lakota Language Consortium allow the Lakota language to not only survive but flourish, giving future generations the chance to embody and spread the culture of their ancestors.
The Rosetta Project's collection on the Internet Archive has records of the Lakota language in the form of three text excerpts: a description of where Lakota was historically spoken; a phonology, which uses a chart to characterize phonemes by linguistic traits; and an orthography, or explanation of the Lakota writing system.
 The Lakota were historically known as the Sioux, but this is an exonym from their Algonquian neighbors to the east, and the term is deprecated today.
Rosetta Project linguists and archivists traveled to Maker Faire this past weekend to demo the Rosetta Disk for a crowd of nearly 80,000 people. We brought the first and second prototypes of the Rosetta Disk, and set up a microscope with a camera to view Disk pages up close. We also had a "Digitization Station" where Maker Faire attendees could watch and participate in the collection of language documentation for the disk.
Would you like to help translate the subtitles of this video? You can here at dotSUB.