2. The Automated Similarity Judgment Program (Google „ASJP project“ for more)

2. The Automated Similarity Judgment Program(Google „ASJP project“ for more)

A bit of history • Jan. 2007: • Cecil Brown (US linguistic anthropologist) comes up with idea of comparing languages automatically and communicates this to • Eric Holman (US statistician) and me. Brown and Holman work on rules to identify cognates implemented in an „automated similarity judgement program“ (ASJP). • May 2007: • Cecil Brown is in Leipzig and explains to me what the two of them have come up with and I begin to take more active part, adding ideas. • Aug. 2007: • Viveka Velupillai (Giessen-based linguist) joins in. • A first paper is written up (largely by Brown and Holman) showing that the classifications of a number of families based on a 245 language sample conform pretty well with expert classification.

Sept. 2007: • Andre Müller (linguist, Leipzig) joins. • Pamela Brown (wife of Cecil Brown) joins. • Dik Bakker (linguist, Amsterdam & Lancaster) joins, and begins to do automatic data-mining, an implementation in Pascal, and to look at ways to identify loanwords. • Oct. 2007: • Hagen Jung (computer scientist, MPI, makes an online implementation). • I take over the „administration“ of the project. • A second paper is finished about stabilities of lexical items, defining a shorter Swadesh list, etc.

Nov. 2007: • Robert Mailhammer (linguist, BRD) joins. • Dec. 2007: • Anthony Grant (linguist, GB) joins. • Dmitry Egorov (linguist, Kazan) joins. • Levenshtein distances are implemented instead of old „matching rules“ identifying cognates. • Febr. 2008 • The two papers are accepted for publication without revision (in respectively Sprachtypologie und Universalienforschung and Folia Linguistica). • March 2008: • The database exceeds 1700 languages and the sample is now world-wide.

Languages currently sampled

Languages in World Atlas of Languages Structures Languages in the ASJP database

The immediate future • Conference presentations in 2008 by various project members on the following issues: • the classification of creoles (Grant et al.) • how to identify loanwords automatically (Bakker et al.) • glottochronology (Wichmann et al.)

Long-term future • A complete database • A monograph on general and methodological results • A classification of the world‘s languages by families

So what have webeen up to so far?

The database • Contents: 40-item lists • Encoding: a simplifying transcription

Towards a shorter Swadesh list Procedure: • Measure stabilities of items on the Swadesh list • Find the shortest list among the most stable items that gives adequate results

Measure stabilites • count proportions of matches for pairs of words with similar meanings among languages within genera • add corrections for chance agreement • weighted means

Check whether it actually makes sense to assume that items have inherent stabilites by • seeing whether the rankings obtained correlate across different areas (in this case New World vs. Old World is convenient)

Stability and borrowability

No correlation within the 100-item list

Potential explanations • Borrowability may be more variable for given lexical items across areas than stability and not be an inherent property of lexical items (similar to typological features). • Borrowability is not a significant contributor to stability, at least as the segment constituted by the Swadesh 100-item list is concerned. • There are still far too little data on borrowability to be conclusive (the sample for studying stability was constituted by 245 languages, whereas we had only 36 language at our disposal for the study of borrowability).

Selecting a shorter list Correlation between distances in the automated approach and other classifications as a function of list lengths Ethnologue (Goodman-Kruskal gamma ) WALS/Dryer (Pearson product-moment correlation)

Transcriptions • 7 vowel symbols • Nasalization indicated but not length, tone, stress • Some rare distinctions merged • „Composite“ sounds indicated by a modifier • Vx sequences where x = velar-to-glottal fricative, glottal stop or palatal approximant reduced to V

Example of transcription: Havasupai (Yuman)

Another transcription example: Abaza (Northwest Caucasian)

Results for classification Two methods of evaluation: Looking at statistical correlations with WALS or Ethnologue classification Comparing tree with „expert trees“/expert knowledge Be my experts, and let‘s look at some families 

Towards Levenshtein chronologies • The formula is similar to the formula for glottochronology: t = log(1- LD)/2log(r), t = time LD = normalized Levenshtein distance r = retention rate

Does Levenshtein chronology work at all? • The assumption of a (fairly) constant rate of change can be checked by looking at branch lengths for lexicostatistical trees. Let‘s see some examples:

Tai-Kadai

Uto-Aztecan

Mayan

How to measure the age of a language group • Take the age of the two most divergent languages? No, this would bias the result high. • Take the average age of all language pairs? No, this would bias the result low. • Make the ages part of the lexicostatistical tree and measure lengths from root (midpoint) to tips? No, this is only doable for a UPGMA tree, which is known to mess up relations.

The last approach is taken by Serva and Petroni (2008) Serva, Maurizio and Filippo Petroni. 2008. Indo-European languages by Levenshtein distances. Available at www.arXiv.org

A suggestion • Find the midpoint in the tree of the language group and take the average ages of all pairs whose members are on either side of the midpoint. When there are outliers (or maybe always) make further partitionings.

Examples of well-balanced trees KORDOFANIAN (subgroup of NIGER-CONGO)

NORTHWEST CAUCASIAN

Examples of somewhat unbalanced trees ALTAIC

BANTOID (subgroup of NIGER-CONGO)

Examples of trees with single outliers SINO-TIBETAN

MORU-MA’DI (SUBGROUP OF NILO-SAHARAN)

Strategy (for outliers especially, but maybe for all groups) 1. Cut the family up by the midpoint and measure its age 2. If age > some constant T (maybe 2000 years) iterate the prodedure 3. Don‘t partition a group further once it is below T in age 4. When all groups are below T (or constitute single languages) measure the age by the average age of all pairs that are not in the same group

Methods for validating Levenshtein chronologies • Calibrate the retention rate for languages with known histories (don‘t expect that this is going to be perfect!) • Correlate them with glottochronologies made by a consistent method

An example of both validation and calibration: stages of Mayan (Mexico/Guatemala)

r = 0.88 (Aberrant dot: Eastern Mayan; without Eastern Mayan r = 0.96) Example of Classic Maya writing, which shows proto- Ch‘olan to break up around 400-600 AD (1600-1400 AD) (The text goes: „Your heart is satisfied, you have set it in order“)

Another prospect for the iterative partitioning procedure • The first ever objective definition of a „family“ or „genus“ • Perhaps we should talk about > 6K (~phylum), 6K (~family), 5K, 4k, 3K (~genus), 2k, 1K (~language), and 1/2K (~dialect) entities? • 6K entities will probably be comprised of 97% non-controversial grouping and some 3% that will eventually not be controversial • > 6K entities will be beyond the reach and concern of ASJP (but hang on until Friday for more hope)

Consider joining ASJP

Consider joining ASJP African Security & Justice Programme? No. Associação Sindical dos Juízes Portugueses? No. Always Sure of Joy and Prosperity? Yes --And never out of hard work.

wichmann@eva.mpg.de

2. The Automated Similarity Judgment Program (Google „ASJP project“ for more)