The single most common mistake I see language technologists make, especially newer ones, is failing to consider the domain of their data.
Mismatched shoes are fun and quirky… mismatched language data domains on the other hand will lead to a LOT of problems.
I’m talking about “domain” here in the linguistic sense, not the networking sense. Basically it means that you are attempting to train a piece of language technology using data that is different from what it will see in production. (This is a special case of out-of-distribution data or, if it occurs due to change over time, distribution shift.) The reason this trips up so many new language technology developers is because of a lack of understanding of what counts as “different”: there may be more factors to consider than you initially realize. Language varies systematically and fairly predictably depending on the situation where it is used. What this means from a machine learning standpoint is that the type of language data used is very strong signal and that systems are very likely to pick up on it during training.
The most obvious type of domain mismatch is topic: what is the text about? If I want to build a system to handle medical text and I only train it with legal text, or customer support chat logs, obviously my system will not be able to handle the specific tokens and expressions that exist in the target medical data but not in the training data.
“Domain” includes a lot more than topic however, and some of the most relevant domain differences are modality, formality, and intended audience.
First up is modality: how was this language originally produced? Was it spoken aloud, written by hand, typed or signed? Each of these is going to have a systematic effect on the language data that you see, even if the topic is more or less the same and they have all been transcribed into the same format. For example, it’s pretty rare to see emojis in handwritten text. You also won’t see spatial referencing in written text or spoken language the way that you do in signed languages.
Even when comparing different inputs–typed as opposed to spoken or dictated–you’ll see pretty big differences. Written text tends to have more rare words while spoken language tends to have more pronouns and a more skewed frequency distribution (i.e. more of the tokens used are very common). This 1998 paper looking at Swedish is a good citation if you’re interested in a deeper dive.
Next up is formality. Often you’ll hear people discussing things like “noisy user generated text”. This means that this text has been produced in an informal setting, like social media, rather than a formal setting like a published paper. (Or even this document!) Informal text has its own patterns that you need to consider. For example, misspellings may sometimes be the result of actual mistakes such as typos but often they encode specific information. This study on French is a great example of this.
You are also more likely to see examples of language use that has been stigmatized in an educational setting. For example, code-switching, or using more than one language or language variety in a single text span. You are also more likely to see different language varieties. In the United States African American English, which is a set of related language varieties predominantly used by the Black community, has been historically extremely stigmatized. As a result you are much more likely to find examples of grammatical structures or words from African American English in informal texts than formal ones (although that is changing and does depend on the specific audience for which the formal text was written). As a result a lot of tools trained on more formal text, like language identifiers, will fail when encountering this particular language variety.
You are also much more likely to see slang in informal text and these slang term generally change very quickly. Consider terms like “based” or “lit” that, depending on when you’re reading this, may already sound extremely dated. On the other hand, very formal text is likely to be more stable over time but also to have its own type of rare tokens, especially jargon.
Speaking of jargon, another common source of variation is due to the intended audience. Consider this in your own life: do you write an email to a close friend the same way that you do to your boss? Even though the format of the text is very similar and it may be on the same topic, who you are producing the language output for has a large effect on what you say and how. And this goes even further than just the specific person you were talking to. Is this language being produced to be broadcast (shared with a large number of people who may not be able to each respond individually to the same degree) or has it been produced as part of a dyadic discussion (with a single other individual)? Broadcast text is likely to assume less background knowledge on the part of the listener whereas dyadic text will assume that you have access to the entire rest of the conversation.
If you have a good idea about what type of language use your system will encounter once it’s in production, you can tailor your training data to include a lot of examples of that type of language use. If you don’t, you are likely to have surprising errors that are extremely difficult to debug since they arise from the data and not the modeling itself.