September 2017

In the course of my day-to-day work on Kaggle’s public data platform, I’ve learned a lot about the ecosystem of language data on the web (or at least the portions of it that have been annotated in English). For example, I’ve noticed a weird disconnect between European and American data repositories resources that I’m pretty sure has its roots in historical and disciplinary divisions.

I’ve also found a lot of great resources, though! At some point, I started keeping notes on interesting data repositories and link aggregators. I finally got around to tidying up and annotating my list of resources, and I figured that it would a useful thing to share with everyone. So, without further ado, here’s an (incomplete) list of some places to find language resources on the web:

META-SHARE
- URL :http://www.meta-share.org/
- META-SHARE has a lot of resources from The International Conference on Language Resources and Evaluation (LREC) on it.
Trolling
- URL: https://dataverse.no/dataverse/trolling
- This data collection is mainly has datasets for replication of linguistics experiments.
Linguistic Data Consortium (LDC)
- URL: https://www.ldc.upenn.edu/
- The Linguistic Data Consortium is an international non-profit that offers archival hosting of datasets. The data offered by them is high quality and usually not free (although they offer data grants for students).
Kaggle
- URL: https://www.kaggle.com/datasets?search=corpus
- Kaggle’s public data platform has a lot of language/NLP datasets available on it, many not in English. You can also do data analysis on Kaggle (with R or Python) without having to download anything or set up a local environment.
European Language Resources Association
- URL: http://catalog.elra.info/, http://universal.elra.info/
- Focus on European languages and language resources, but the universal catalog (second link) has a broader focus.
Zenodo
- URL: https://zenodo.org/
- Hosted by CERN, has datasets (including corpora) from a wide variety of disciplines.
Document the Now
- URL: http://www.docnow.io/catalog/
- Contains lists of Tweet ID’s surrounding certain events. You’ll need to use the “rehydrator” to get the actual tweets.
International Standard Language Resource Number
- URL: http://www.islrn.org/resources/identify_name/ (a list of unique ID #’s associated with language resources)
- Like a digital object identifier (DOI) for language resources. Not the best search (only looks at the title) but if you have a specific phrase you’re looking for it can be a good way to discover new resources.
Language & Culture Archives (SIL)
- URL: https://www.sil.org/resources/language-culture-archives
- Focus on ethnolinguistic minority communities, in many cases the only publicly available data for a given language.
Open Language Archives Community (OLAC)
- URL: http://www.language-archives.org/
- Includes a helpful metadata quality analysis for each onboarded dataset. (A higher score = more complete metadata)
Free sound
- URL: http://freesound.org/
- Freesound is a collaborative database of Creative Commons Licensed sounds. Note that some of the speech is synthetic. Helpful automatic annotation utilities can be found here: https://github.com/CrowdTruth/VU-Sound-Corpus/tree/v1.0
GitHub
- URL: https://github.com/search?q=corpus
- You can sometimes find interesting & high quality language data on Github, but it’s not centralized and of widely varying quality.
Re3data.org
- URL: http://www.re3data.org/search?query=&subjects%5B%5D=104%20Linguistics
- A link aggregator. It has a lot of overlap with other datasets but can be a good place to start looking.
Language Gold Mine
- URL: http://languagegoldmine.com/ (By Bodo Winter)
- Another collection of links, well-tagged by content type.

Know of a resource I forgot to include? Link it in the comments!

Month: September 2017

Dance Your PhD: Modeling the Perceptual Learning of Novel Dialect Features

Where can you find language data on the web?

Share this:

Share this: