The genetic composition of the native languages of the Americas is one of the major unsolved issues in present-day science. For both subcontinents about 175 phylogenetic lineages (language stocks, language families, linguistic isolates) can be distinguished. Historical comparative linguists working on New World languages tend to be skeptical about any attempts at reducing the number of lineages by proving that at least some of them may derive from a common source. Progress is hampered by the fact that few researchers can oversee the field as a whole due to its formidable complexity and the need for ready-to-use knowledge of as many languages and language groups as possible. At the other extreme of the spectrum of opinions, Greenberg (1987) posited the genetic unity of all New World languages except for two North American lineages (Na-Dene and Eskimo-Aleut). Greenberg’s argument for a Pan-American language family called Amerind has been rejected by a majority of linguists in the field, mainly on methodological grounds (Adelaar 1989, Campbell 1988, Matisoff 1990). Meanwhile, the amount of accessible data on Native American languages of all areas has increased enormously over the last two decades due to a wave of interest in language documentation and language analysis focusing on languages in danger of extinction. Since most Native American languages are threatened with extinction, our knowledge of these languages has benefited from this current to a considerable extent. Nevertheless, the expectations for a breakthrough in our understanding of the Americas in terms of phylogenetic relations remain unchanged. A synthetic approach of the New World linguistic past is urgently needed.

The present project will contribute to such a synthetic approach by focusing on two key areas of high demographic weight, Mesoamerica (Mexico and Western Central America) [1] and the Middle Andean region (roughly the coast and highlands of Ecuador, Peru and Bolivia), in order to improve our understanding of the linguistic dimension of the history of the peopling of the Americas. The project seeks to identify pre-colonial linguistic connections between these two areas at minimally two levels of antiquity: (i) commonly inherited similarities related to the earliest migrations, and (ii) similarities due to secondary contact between the two areas after they had developed a relatively high population density. The project will benefit from the experience of the PI, who has a long history of research and scientific engagement as a leading specialist in both areas, and that of the Co-Investigator who is a leading authority in the field of Mesoamerican languages.

It has been established since the late 16th century that the American native population had its origins in northeastern Asia (Acosta 1590). Siberian hunters and gatherers occupied Beringia, the area, now submerged, that united Siberia and western Alaska during the last Ice Age (before 13.000 B.P.). Subsequently, descendents of these first American Indians migrated in a southward direction until both American subcontinents were populated. Until recently, the relatively late human occupation of the Americas was attributed to the existence of an ice cap that covered most of the northern half of North America during the Ice Age (e.g. Lynch 1990). Because of this obstacle, the first settlers of Alaska would not have been able to continue their trek southward until an ice-free passage developed east of the Rocky Mountains around 11.500 B.P.

Nowadays, many researchers envisage the possibility of coastal migrations along the Pacific coast that may have begun at an earlier stage, unhindered by the ice cap (Fladmark 1979, Dillehay et al. 1992). The hypothesis of such a coastal migration route is of particular importance for the peopling of countries situated along the Pacific Ocean, in both North and South America, as the new settlers would have been dependent on marine resources (shellfish) and would not have been attracted to an inhospitable and still largely uninhabited interior. The possibility of such early migrations has recently been strengthened by large finds of prehistoric tools in Oregon, Texas and on the Channel islands of California (Erlandson et al. 2011, Waters et al. 2011). Either way, Mesoamerica with its mild climate and its funnel-shaped geographical form became the logical goal for migrations out of North America and soon attained a relatively high level of population density.

The population that first reached South America had to find its way through another narrow and difficult passage, the jungle of eastern Panama.[2] From there on, migrations continued into two basic directions, along the Pacific coast west of the Andes and into the tropical lowlands east of the Andes. Alternatively, the Pacific coast of South America may have been settled by marine populations who used boats for relatively long coastal treks. The central section of the South American Pacific coast with its adjacent Andean region became another area of relatively high population density, especially after the first agricultural and urban development some 5.000 years ago (Mann 2005).

In spite of its relatively recent character, the native population of the New World exhibits an extreme genetic diversity in the linguistic domain, which is only equaled by the languages of New Guinea. True enough, a common origin for the native languages of South America and most of North America, including detailed sub-classifications, was claimed by Greenberg (1987), but his classification has not been accepted by a majority of the researchers working on these languages, due to its methodological insufficiency, poor handling of data, and general lack of arguments (Adelaar 1989, Campbell 1998, Matisoff 1990). The linguistic diversity of the Americas does not primarily reside in the number of its languages,[3] but in the number of linguistic lineages consisting of language stocks, language families and linguistic isolates (= languages without known genetic relatives). The number of linguistic lineages established for North America and Mesomerica (including Central America) together oscillates around 75 (Campbell and Mithun 1979). Conservative counts by Loukotka (1968) and Kaufman (1990), respectively, distinguish 117 and 118 separate lineages for South America (including the Caribbean islands).[4]  The two South American lists do not coincide completely, suggesting that the number of lineages to be distinguished for this area can actually be higher than 118. Furthermore, scores of undocumented, extinct languages and potential isolates have never been accounted for in any classification. On the other hand, recent advances in the genetic classification of the languages of eastern South America have made possible a reduction of the number of lineages recognized for that area. The Macro-Jê hypothesis (Rodrigues 1999), which brings together a number of language families and former linguistic isolates spoken in Brazil and Bolivia (Adelaar 2008, van der Voort and Ribeiro 2010), has become accepted. Meanwhile, a more comprehensive genetic grouping involving the Cariban, Tupian and Macro-Jê languages (Rodrigues 2000, 2009) is also envisaged.

Speaking in typological rather than in genetic terms, the native South American languages are characterized by a division between the languages of the Andean region (including areas to the south and north of the Middle Andes region, which is one of the foci of this project), on the one hand, and the languages of the tropical lowlands of eastern South America, on the other. This Andean-Amazonian division, which is reflected by fundamental structural differences between the languages of each region, suggests that the Andean languages have their origin in populations that came forth from the early coastal migrations, rather than from the eastern lowlands, whereas the eastern lowland languages may go back to migrations that followed a route to the east of the Andes. However, extended periods of language contact between Andean populations and Amazonian populations have contributed to making the division between the two linguistic areas more fluid (for a relevant case study see Adelaar 2006).

Although phylogenetic links have been uncovered between most of the major linguistic families of eastern South America (see above), no such advances have been made with respect to the languages of the Middle Andes region. So far, they continue to defy all efforts at genetic classification, both internally (within the region) and externally (in connection with other regions). Hence, the project looks at connections for the Middle Andean languages in areas where the speakers of these languages originally came from, that is, the North American subcontinent, Mesoamerica in particular. The linguistic, archaeological and human population genetics perspectives on the migration history of the two areas will be assessed in a Post-doc project, thus providing a general context to the search for linguistic remnants assignable to the earliest period of settlement.

No less importantly, the project searches for linguistic evidence of (pre-Columbian) secondary contacts between Mesoamerica and the Middle Andes. Archaeological evidence has shown that such contacts must have existed, as highly complex metallurgical techniques (bronze based on copper-arsenic alloys) were transmitted from the northern Peruvian coast to northwestern Mexico during the second half of the first millennium of our era (Hosler and Stresser Péan 1992). Close similarities in the artistic traditions of coastal Ecuador and Western Mexico have attracted attention for decades (Willey 1971, Anawalt 1992). Trade relations involving luxury products, such as the shell Spondylus princeps, involved coastal populations from northern Peru to Mexico (Shimada 1999, Stothert 2001). It seems inconceivable that such highly specialized contacts would not have left any linguistic traces (Hopkins in press). Finding the linguistic correlates of these cultural movements, which over the centuries must have left human settlements at different locations of the contact area, is one of the objectives of this project.

Two partly documented linguistic isolates, Purépecha (or Tarascan) and Mochica, occupy key positions in the search for secondary linguistic relations between the Middle Andes and Mexico. Purépecha, spoken in the state of Michoacán (Western Mexico), was the language of a powerful kingdom, rival to the Aztecs, during the Spanish conquest. The Purépecha language (Foster 1969, Chamoreau 2000) is not only not related to any other Mesoamerican language, it is also typologically unusual for that area. Its agglutinative structure based on suffixes and its use of an accusative case marker (rare in the Americas) is reminiscent of the Andean languages Aymara and Quechua, as well as the Barbacoan languages of Ecuador and southern Colombia (Awa Pit, Guambiano, etc.).[5] Furthermore, connections with the Chibchan family of Central America have been proposed (Greenberg 1987). Finding the closest linguistic relatives of Purépecha is a challenge that this project will take up. It will be the subject of a Ph.D. project.

Mochica, partly documented by Carrera (1644) and Middendorf (1892), was spoken until around 1950 on the northern coast of Peru, by descendants of the northern Mochica and Lambayeque civilizations. Mochica is also a linguistic isolate and it is typologically divergent from the surrounding languages to an extreme degree. Many of its particularities (numeral classifiers, profuse use of passive constructions, strict separation between possessed and non-possessed nouns, enclitic tense and personal reference markers, CVC root structure matched with CVCVC types of instrumental derivation), as well as lexical similarities, are reminiscent of the Mayan languages in Mesoamerica (Stark 1968, Adelaar and Muysken 2004). Defining the genetic position of Mochica also constitutes a priority for this project, which will be addressed in a Ph.D. project.

Several authors have pointed at deep similarities between many of the languages located on the Pacific side of the Americas (Nichols 1992, Liedtke 1996), suggesting old coastal migratory movements and possibly secondary coastal contacts. To disentangle the layers of antiquity to which such coastal migrations and contacts correspond is a challenge for this project, for which criteria must be developed. By developing such criteria, the project will contribute to the analysis of multi-layered contact situations that have been built up during a considerable period of time. For a successful assessment of the linguistic contacts that existed along the Pacific coast extending from Mesoamerica to the Middle Andes, it is necessary to focus on all the languages that were spoken in this area in past and present. Since the Pacific coast suffered massive depopulation after the Spanish conquest, mainly as a result of epidemics (Denevan 1992), many of these languages, including a number of isolates, became extinct and are scarcely documented (e.g. languages of southern Baja California and Cuitlatec in Mexico, Xinca in Guatemala, Lenca in Honduras and El Salvador, Cueva in Panama, Esmeraldeño in Ecuador, Sechura, Tallán and Quingnam in northern Peru). Other linguistic isolates are still viable, but little is known about their genetic identity (Huave and Oaxaca Chontal in Mexico). In addition, some of the more widespread, interior-based languages such as Uto-Aztecan languages of northwest Mexico, the Oto-Manguean languages Mixtec, Tlapanec and Zapotec (Mexico), Mayan languages (Guatemala and Chiapas), Chibchan and Chocoan languages (Central America and Colombia), Barbacoan (Colombia, Ecuador) and Quechuan (Ecuador, Peru) also have coastal connections. A focused study of the common elements that can be found in these languages will be the subject of a Post-doc investigation in this project. Particular attention will be given to the existence of maritime terms which may have diffused among coastal populations.[6]

The Central American land bridge and the Andean part of Colombia are known among archaeologists as the Intermediate Area (see Constenla 1991 for its linguistic correlate). This intermediate region is geographically dominated by an important language family with a considerable internal time-depth: Chibchan. Whether Chibchan had its roots in South America and moved into Central America or the other way around remains to be established, although the latter possibility seems to have more adherents. Although the Chibchan migrations appear to be mostly land-bound, the possibility of coastal implications are evident, considering the vicinity of the Pacific ocean throughout Central America.  In addition, all the other linguistic lineages of the Intermediate Area (Chocoan, Misamulpan, Páez, etc.) will be taken into consideration, particularly, in the perspective of their external bilateral relations. The Intermediate Area and its external relations will be the subject of a Ph.D. project.

The two main Andean language families, Aymaran and Quechua, play an important role in the project. As has been amply discussed in the literature (e.g. Heggarty 2005, Cerrón-Palomino 2008), these two language groups are structurally and phonologically very similar and share a substantial percentage of common lexicon. Nevertheless, they are not considered to be genetically related.[7] The search for genetic relations involving these language families will require a separation of their lexicon, in which the common lexicon is either set apart or assigned to either one of the two language groups according to criteria to be developed. Initial methodological steps for teasing apart  the Aymaran and Quechuan lexicon were presented in Adelaar (1986). A Post-doc will carry on this task and prepare the Aymaran and Quechuan data for external comparison. The two language groups will then be subjected to bilateral comparison with a previously selected set of lineages. The selection of these lineages will be supported by the Automated Judgment Similarity Program for lexicostatistic analysis or AJSP (see Methodology). Finding the external relations of Aymaran and Quechuan will be a test case for both the AJSP and the method of bilateral language comparison to be introduced below.

As indicated before, the existence of secondary contacts between Mesoamerica and the Middle Andes finds support in archaeological and (ethno)historical data, which point at pre-colonial trade movements and technological exchange between the two areas. A study of linguistic contacts and migrations must be well-informed about the findings of related disciplines in this domain. To assess the evidence of secondary migrations and contact based on research other than linguistic research will be the subject of a Ph.D. project.

Overview of Mesoamerica and the Middle Andes


In this project, the search for possible genetic links connecting languages of Mesoamerica and the Middle Andes is primarily based on qualitative criteria. Quantitative methods are only used as an orientation tool to establish research priorities where necessary. The PI is convinced on the basis of his own experience that the possibilities of applying qualitative techniques for the discovery of genetic relations between apparently unrelated native American languages are far from exhausted. In 1999, the PI established a genetic relationship between two alleged linguistic isolates of western Amazonia that had never been associated or brought into comparison before, Harakmbut (southeastern Peru) and Katukina (state of Amazonas, western Brazil) (Adelaar 2000). Subsequently, application of the same qualitative method revealed a genetic link between an alleged isolate, the Chiquitano language of eastern Bolivia, with the Macro-Jê language stock of Brazil (Adelaar 2008). The Chiquitano-Macro-Jê relationship was brought forward by Greenberg (1987) but had never been accepted by leading authorities in the field (e.g. Rodrigues 1986). Yet another recently discovered genetic link is that of the Jabutian languages (state of Rondônia, western Brazil) and Macro-Jê (Ribeiro and van der Voort 2010).

The establishment of more genetic connections between New World languages is an urgent task that must not be left to coincidence. The process can be accelerated by a systematic search focused on pairs of pre-selected languages or, in the case of well-established language families, of (partly) reconstructed proto-languages. This method of bilateral language comparison is novel in that it has never been systematically applied to languages of the Americas. It may be a laborious procedure, but it is the best way to uncover unexpected links in a situation where so little is known about the pre-history of the languages at issue. The project intends to try and find hitherto unsuspected links between Mesoamerican and Andean languages by this procedure. To this end at least a hundred selected language pairs (more if necessary) will be analyzed. Whenever adequate reconstructions of proto-languages are available, they will be used in the comparisons. If necessary, such reconstructions will be provided for by the team.

Lexical comparisons between languages must not be based on raw lexical data as has often been the case in the analysis of lexicostatistical data taken from unanalyzed word-lists. In order to ensure a successful and objective comparison, the lexical material is analyzed and reduced to its etymological essence by first eliminating residual morphological material. Such morphological material may not only consist of productive inflectional affixes that are easy to recognize (for instance, possessive affixes with nouns, subject or agent markers with verbs), but also of remnants of morphological formations that are no longer productive. When grammatical descriptions do not provide the necessary information to identify such elements, the language data will be analyzed accordingly by the team members of the project. Thus, instead of measuring the loss and/or retention of lexical similarity, which is the main function of lexicostatistical analysis, this method seeks to actively recover inherited language material by internal reconstruction, trying to arrive at earlier stages of a language without falling into arbitrariness. Semantic shifts will be taken into account whenever they fit the cultural context of the languages at issue. For instance, the members of the semantic pairs head/hair, hand/leaf, dog/jaguar are often interchanged in languages of the tropical lowlands of South America. In a Mesoamerican or an Andean context, these shifts cannot be automatically expected to occur, but other types of frequent semantic shift may be established. The ‘pro-active’ procedure depicted here proved essential for the confirmation of the genetic relationship between Chiquitano and the Jê languages. It was subsequently confirmed by morphological and morphophonemic correspondences, which made this case of genetic relationship even stronger (Ribeiro 2010).

For the establishment of genetic links, initial discovery procedures will be based on basic lexicon unlikely to be borrowed even in intense contact situations. It is assumed that, whenever two languages are genetically related, regular lexical correspondences are more likely to show up in the basic vocabulary (body part terms, kinship terms, non-cultural terms referring to the natural environment, etc.), than in other parts of the lexicon of the languages concerned. In case correspondences are more frequently found in non-basic, cultural lexicon a borrowing scenario is more likely. These premises are of extraordinary importance for the two main Andean language families, Aymaran and Quechuan, which share more than 20% of their vocabulary at the family level. However, few similarities are found in the most basic vocabulary, leaving lexical diffusion and language contact as the most likely source for the similarities (Adelaar 1986, Heggarty 2005, Cerrón-Palomino 2008).

A useful tool for the selection of basic lexicon is the list of 40 lexical items established by Holman, Wichmann and Brown at the Max-Planck Institute for Evolutionary Anthropology in Leipzig, in the framework of the Automated Similarity Judgment Program (ASJP) for computerized lexicostatistical analysis (Holman et al. 2008). The ASJP basic vocabulary list makes it possible to recognize similarities in the basic vocabulary of language pairs at an early stage. This 40 items list has been shown to be more effective than a corresponding list of 100 items (basically adapted from the Swadesh list selection of 100 items). It must be emphasized that the ASJP list will be used in this project mainly as a fact-finding procedure. Once a genetic link is discovered by means of ASJP or otherwise, it will be analyzed to the deepest possible extent using the procedure of bilateral language comparison in order to establish regular correspondences as required by the standard comparative method (cf. Campbell 1998).

The emphasis on finding inherited similarities does not mean that the project disregards similarities based on borrowing or diffusion. The identification of language contact situations and elements of lexical diffusion is just as important for the study of pre-modern migrations and trade contacts as the discovery of genetically based similarities (cf. Aikhenvald and Dixon 2001). It will be assumed that lexical similarities between unrelated languages must be attributed to borrowing unless compelling evidence of a common genetic origin is found. As a matter of course, any links once detected will always be followed up by an in-depth study of the correspondences, and these may turn out to be the result of contact when phonologically unpredictable or when the criterion that basic vocabulary must show more common items than non-basic vocabulary is not met.

Linguistic field research is an essential part of this project. If essential data can only be obtained by direct contact with speakers, field research by team members will be encouraged and funded by the project. Similarly, research in local archives geared at the recovery of unpublished data on extinct languages that are crucial to the project will also be facilitated.

A particular methodological challenge for the project will be to establish criteria for distinguishing between similarities assignable to the initial population movement into the Americas and migrations or contacts that occurred at a later stage. The team will give continuous attention to this question in regular meetings and by comparing intermediate results. It is to be expected that relatively recent contact phenomena will be more regular and more recognizable, than old inherited similarities, but the project team will keep an open mind to any indication that may provide a better insight into this highly fascinating matter.


[1] The Mesoamerican area was defined as a cultural and archaeological area by Kirchhoff (1943). For its linguistic correlate see Campbell, Kaufman and Smith-Stark (1986).

[2] There are no indications that the Caribbean island region was used for migrations into South America at this early stage.

[3]  About 900 languages are attested in the Americas, including recently extinct ones, but not counting an undefined number of poorly documented or undocumented languages that have disappeared since the European discovery and occupation.

[4] There is hardly any overlap between the lineages attested in the two areas. Only the Chibchan language family occupies areas in South America, as well as areas that can be assigned to Mesoamerica. Another case of overlap is the Garífuna language in Central America. It belongs to the Arawak language family, which is native to South America and the Caribbean islands.

[5] Swadesh’s (1967) attempt at establishing a relationship between Purépecha and Quechua failed because it was based on an unhappy choice of vocabulary correspondences. Liedtke’s (1996) comparison looks more promising in this respect.

[6] An interesting example is the Mochica term chomme ‘sea-lion’, which is found as thumi in 16th century coastal Quechua. The similarity with the Araucanian (Mapuche) form l.ame ‘sea-lion’ is suggestive. 

[7] This is an everlasting debate that has its origins in the 17th century. There are still adepts of a Quechuan-Aymaran genetic unity, but the leading view is that they are not demonstrably related.