Language models need to be taught all basic facts, but teaching them these basic facts while also testing them on these facts necessarily constitutes semantic duplication.
Consider the following example:
Train set question: "What formula could one use to calculate the area of a circle?"
Test set question: "What is the formula for the area of a circle?"
These questions are semantically equivalent, but this is not a form of contamination. The questions could be created totally independently, yet end up being semantically equivalent. This would show up as contamination when using semantic similarity methods, like proposed by LMSys (https://lmsys.org/blog/2023-11-14-llm-decontaminator/). However, this form of "contamination" is not only not a problem, but is even beneficial.
There's no way to teach a model basic factual questions without having semantic duplicates like this. This isn't a problem though - we want to teach a model to learn the exact facts, so we should expect to have duplicates. This doesn't invalidate test metrics - we are testing exactly that, if the model knows the exact facts.
There are cases where it does matter though:
- In a multiple-choice test, the options should be expected to not all be semantic duplicates between the test/train set. An example of this is listed by LMSys in their blog post, for MMLU
- In math data, a question should not be both a semantic duplicate and use identical numeric values. We should expect some semantic duplicates in the form of two questions being solvable in the same way, but we should not expect identical numeric values between the duplicates.
~ Jade