Semantic Duplicates Don't Necessarily* Matter

Edit

Language models need to be taught all basic facts, but teaching them these basic facts while also testing them on these facts necessarily constitutes semantic duplication.

Consider the following example:

Train set question: "What formula could one use to calculate the area of a circle?"
Test set question: "What is the formula for the area of a circle?"

These questions are semantically equivalent, but this is not a form of contamination. The questions could be created totally independently, yet end up being semantically equivalent. This would show up as contamination when using semantic similarity methods, like proposed by LMSys (https://lmsys.org/blog/2023-11-14-llm-decontaminator/). However, this form of "contamination" is not only not a problem, but is even beneficial.

There's no way to teach a model basic factual questions without having semantic duplicates like this. This isn't a problem though - we want to teach a model to learn the exact facts, so we should expect to have duplicates. This doesn't invalidate test metrics - we are testing exactly that, if the model knows the exact facts.

There are cases where it does matter though:

~ Jade