View on GitHub

openlexicon

Access to lexical databases

Open lexical databases

You will find below a directory of open lexical databases. Click on the name of any database to access their README file and obtain more information and links to datasets.

Usage

Most datasets are provided in form of .tsv or .csv files (tab-separated-values or comma-separated-values). These are plain text files which can be easily imported in to R or Python, or even opened with Excel. Check out our script examples.

In R or Python, you can directly download datasets from the links provided in the README file. For example:

in Python:

  import pandas as pd
  lex = pd.read_csv('http://www.lexique.org/databases/Lexique383/Lexique383.tsv', sep='\t')
  lex.head()

in R:

  library(readr)
  lex = read_tsv('http://www.lexique.org/databases/Lexique383/Lexique383.tsv')
  head(lex)

Yet, in R, we recommend you to use the R dataset fetcher as:

it avoids having to specify the location of the dataset on the web
it will always point to the latest version of a dataset if it has been updated
it provides a caching mechanism: the dataset will be downloaded only if necessary, otherwise a local copy will be used.
it checks the sumfile of the dataset to make sure that you have the correct version.

For example, to download the table of Lexique383:

require(tidyverse)
require(rjson)
source('https://raw.githubusercontent.com/chrplr/openlexicon/master/datasets-info/fetch_datasets.R')
lexique383 <- get_lexique383()

Many of these databases can also be explored or queried on-line at http://www.lexique.org/shiny/openlexicon, thanks to shiny apps from openlexicon.
Most databases have associated publications listed in their respective README files. They should be cited in any derivative work!

Français

Base	Description
Lexique3	Lexique3 est une base de données lexicales du français qui fournit pour ~140000 mots du français: les représentations orthographiques et phonémiques, les lemmes associés, la syllabation, la catégorie grammaticale, le genre et le nombre, les fréquences dans un corpus de livres et dans un corpus de sous-titres de films, etc.
Anagrammes	Anagrammes liste plus de 25000 ensembles d’anagrammes du français.
Voisins	Voisins liste les voisins orthographiques par substitution d’une lettre pour 130000 mots français.
French Lexicon Project	The French Lexicon Project (FLP) was inspired from the English Lexicon Project (Balota et al. 2007). It provides visual lexical decision time for about 39000 French words and as many pseudowords. The full data represents 1942000 reactions times from 975 participants.
Megalex	Megalex provides visual and auditory lexical decision times and accuracy rates several thousands of words: Visual lexical decision data are available for 28466 French words and the same number of pseudowords, and auditory lexical decision data are available for 17876 French words and the same number of pseudowords.
Chronolex	Chronolex provides naming times, lexical decision times and progressive demasking scores on most monosyllabic monomorphemic French (about 1500 items). Thirty-seven participants were tested in the naming task, 35 additionnal participants in the lexical decision task and 33 additionnal participants were tested in the progressive demasking task.
Brulex	Brulex donne, pour environ 36.000 mots de la langue française, l’orthographe, la prononciation, la classe grammaticale, le genre, le nombre et la fréquence d’usage. Il contient également d’autres informations utiles à la sélection de matériel expérimental (notamment, point d’unicité, comptage des voisins lexicaux, patrons phonologiques, fréquence moyenne des digrammes).
Gougenheim100	Gougenheim100 présente, pour 1064 mots, leur fréquence et leur répartition (nombre de textes dans lesquels ils apparaissent). Le corpus sur lequel, il est basé est un corpus de langue oral basé sur un ensembles d’entretiens avec 275 personnes. C’est donc non seulement un corpus de langue orale mais aussi de langue produite. Le corpus original comprend 163 textes, 312.135 mots et 7.995 lemmes différents.
Chacqfam	CHACQFAM est une base de données renseignant l’âge d’acquisition estimé et la familiarité de 1225 mots Français
Frantext	Frantext fournit la liste de tous les types orthographiques obtenus après tokenization du sous-corpus de Frantext utilisé pour calculer les fréquences “livres”” de Lexique.
francais-GUTenberg	Liste de 336531 mots français obtenue à partir du dictionnaire ispell Français-GUTenberg
Morphalou	Lexique à large couverture, comprenant 159 271 lemmes et 976 570 formes fléchies, du français moderne.
Morpholex-fr	Lexical database for ~38k French words with morphological variables.
Fr- Familiary660	Familiarités de 660 mots estimées par des adultes jeunes et des adultes âgés.
SemantiQc	Ces bases de données représentent la familiarité conceptuelle, la force perceptuelle auditive et visuelle de 3596 mots de la langue française auprès de 304 adultes francophones québécois.

English (American and British)

Base	Description
SUBTLEX-US	SUBTLEXus (Brysbaert, New & Keuleers, 2012) provides two frequency measures based on American movies subtitles (51 million words in total): a) The frequency per million words, called SUBTLEXWF (word form frequency) b) The percentage of films in which a word occurs, called SUBTLEXCD (contextual diversity)
British Lexicon Project	The British Lexicon Project (Keuleers et al, 2012) contains lexical decision data for over 28,000 monosyllabic and disyllabic English words..
English Lexicon Project	The English Lexicon Project provides a standardized behavioral and descriptive data set for 40,481 words and 40,481 nonwords. Data from 816 participants across six universities were collected in a lexical decision task (approximately 3400 responses per participant), and data from 444 participants were collected in a speeded naming task (approximately 2500 responses per participant)
Morpholex-en	Lexical database for ~70k English words with morphological variables.

Chinese

Base	Description
SUBTLEX-CH	SUBTLEX-CH (Cai & Brysbaert 2010) is a database of Chinese word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words).

Multilingual

Base	Description
WorldLex	Worldlex provides word frequencies estimated from web pages collected in 66 languages.
AoA-32lang	AoA-32lang presents a set of subjective Age of Acquisition (AoA) ratings for 299 words (158 nouns, 141 verbs) in 32 languages.

Similar lists or resources

Marc Brysbaert’s web site at http://crr.ugent.be/programs-data
Meiryum Al’s Best 25 Datasets for Natural Language Processing

Contributing

If you want to contribute, check out the OpenLexicon project

Time-stamp: <2019-05-01 11:24:52 christophe@pallier.org>