Rosetta Collection


The Rosetta Disk and 1000 Language Archive

The Rosetta Project began with an experiment in microetching technology, the Long Now Foundation wanting to evaluate the potential of microetched nickel disks for the very long term archiving of textual and image data. Based on tests conducted by Los Alamos National Labs, such a disk could arguably withstand a variety of extreme environmental conditions and remain legible under magnification, providing a record far into the future. Because the disk is an analog representation of page images (similar in concept to microfiche, but with a much higher page density), it also avoids the well known problems of digital storage.

Looking for suitable material to put on the disk prototype, the Long Now Foundation decided to initiate a project to collect basic descriptive information for 1000 of the world’s approximately 7000 languages. Why language information? Since most natural human languages are products of millenia of human history, they make compelling subject matter to inspire long term thinking. Languages are also repositories of cultural information, showcasing human intellectual sophistication and cultural diversity, and are therefore appropriate content for such an artifact.

As a means of organizing and disseminating the collected information, the Board decided to create a website--the seeds of the current Rosetta Project Digital Archive. Initial collection efforts focused on assembling basic "descriptive components" for each language: general metalinguistic information, phonology, grammar (morphology and syntax), numbers, a Swadesh word list, a parallel text (Genesis Ch. 1-3), glossed vernacular texts, maps, and orthographic information. These content areas were selected as the minimum representation most likely to be useful for contemporary reference, research and education, as well as a best guess as to what might be of relevance for future linguistic archaeology. For the purposes of the Project web site, these descriptive components provided basic contextualizing information for each language, and served as a framework for more in-depth future collection.


The ALL Language Archive: A National Science Digital Library Collection

In 02004, the Rosetta Project: ALL Language Archive was awarded a $1,000,000 NSF National Science Digital Library grant to grow the breadth and depth of the Rosetta collection as well as to elaborate the navigation, search, interoperability, networking, and collaborative tools for users of and contributors to the site. A primary product of this grant is the current website, Rosetta V2.0, a linguistic-specific CMS built in Plone with the means of organizing and displaying resources on any language, language family, subgroup, or dialect.

During the period of this grant, the Rosetta Project more than doubled its collection size. Now serving over 70,000 pages of language documentation on over 2,500 languages, it is the largest descriptive linguistic resource on the Net. A primary focus of in-house collection was the scanning of print materials available in local University libraries, with the goal of digitizing (to the extent possible) the legacy "paper trail" of language documentation, generally consisting of descriptive grammars, dictionaries, and collections of texts.

Alongside document collection, the Rosetta Project is also compiling a large corpus of lexical data in the form of Swadesh Lists. To date, these lists have come from the following sources: Darrell Tryon's Comparative Austronesian Dictionary (1995), Tim Usher's Indo-Pacific database (2002 version), Paul Whitehouse's Australian and New Guinea database (2002 version), George Starostin's Dravidian database, and Ilya Peiros' Mon Khmer database. In many cases, as with the Usher and Whitehouse collection, the 100-200 term Swadesh lists are a small subset of a much larger collection of lexical data, which is the subject of a new project being developed with MPI EVA.

Besides the continuation of in-house collection (described above), a number of organizations participated in the development of the content of the site. (Formal collaborators on the grant are indicated with a *):

The Linguist List* developed the "People Search", enabling searches for linguists working on many of the languages in the collection. (This service is hosted at the Linguist List site.)

The Endangered Language Fund* contributed archived physical resources in its collection as a means of developing our techniques and procedures for the digitization of non-print media. A large collection that comes to us through the ELF is the Alan Lomax Global Language Archive. As a result of this collection, we developed the capacity to digitize reel-to-reel audio, and were able to elaborate our consent forms for new depositors.

SIL International contributed scans of the majority of the print documents in their Language and Culture archives, primarily consisting of wordlists, dictionaries, and grammatical documentation.


After eight years of work, The Rosetta Project completes the first edition Rosetta Disk. Five limited edition disks are distributed to Rosetta Disk sponsors.


A digital version of the Rosetta Disk is made available for online interactive browsing, and for sale as a DVD in the Long Now Museum Store.

The Rosetta Disk

Fifty to ninety percent of the world's languages are predicted to disappear in the next century, many with little or no significant documentation.