In the course of my day-to-day work on Kaggle’s public data platform, I’ve learned a lot about the ecosystem of language data on the web (or at least the portions of it that have been annotated in English). For example, I’ve noticed a weird disconnect between European and American data repositories resources that I’m pretty sure has its roots in historical and disciplinary divisions.

I’ve also found a lot of great resources, though! At some point, I started keeping notes on interesting data repositories and link aggregators. I finally got around to tidying up and annotating my list of resources, and I figured that it would a useful thing to share with everyone. So, without further ado, here’s an (incomplete) list of some places to find language resources on the web:
- META-SHARE
- URL :http://www.meta-share.org/
- META-SHARE has a lot of resources from The International Conference on Language Resources and Evaluation (LREC) on it.
- Trolling
- Linguistic Data Consortium (LDC)
- URL: https://www.ldc.upenn.edu/
- The Linguistic Data Consortium is an international non-profit that offers archival hosting of datasets. The data offered by them is high quality and usually not free (although they offer data grants for students).
- Kaggle
- URL: https://www.kaggle.com/datasets?search=corpus
- Kaggle’s public data platform has a lot of language/NLP datasets available on it, many not in English. You can also do data analysis on Kaggle (with R or Python) without having to download anything or set up a local environment.
- European Language Resources Association
- Zenodo
- URL: https://zenodo.org/
- Hosted by CERN, has datasets (including corpora) from a wide variety of disciplines.
- Document the Now
- URL: http://www.docnow.io/catalog/
- Contains lists of Tweet ID’s surrounding certain events. You’ll need to use the “rehydrator” to get the actual tweets.
- International Standard Language Resource Number
- URL: http://www.islrn.org/resources/identify_name/ (a list of unique ID #’s associated with language resources)
- Like a digital object identifier (DOI) for language resources. Not the best search (only looks at the title) but if you have a specific phrase you’re looking for it can be a good way to discover new resources.
- Language & Culture Archives (SIL)
- Open Language Archives Community (OLAC)
- Free sound
- GitHub
- URL: https://github.com/search?q=corpus
- You can sometimes find interesting & high quality language data on Github, but it’s not centralized and of widely varying quality.
- Re3data.org
- Language Gold Mine
Know of a resource I forgot to include? Link it in the comments!
Like this:
Like Loading...