Which Number Is Larger: 5.12 or 5.8?

We all saw this viral test of LLMs:

User: Which is larger, 9.11 or 9.8?
LLM: 9.11 is larger than 9.8.

And now this is making the rounds:

User: Which is larger, 5.12 or 5.8?
LLM: 5.8 is larger than 5.12.

While this appears to show improvement, it raises the question: Is 5.8 > 5.12 being answered correctly for the right reason, or did the previous attention on 9.11 > 9.8 result in additional training that handled this specific case?

It's easy to see this as yet another instance of "LLMs are bad at math" but actually this is really interesting in terms of how LLMs work. They aren't doing math; they are predicting the next token (essentially the next word or part of a word) based on the context.

The result isn't a calculation — it's a prediction of what token is most likely to come next. And that prediction is not based just on the semantics of the words. It is based on the relationships between these tokens found in the training data.

9.11 has a specific meaning in U.S. context: it's the date of a significant historical event, and also is two digits after the decimal place. 9.8 is typically just a number (with only one digit after the decimal place). Given that the training data likely has more instances of 9.11 as a date than as a decimal number, you're going to get some competing influences on the output.

If this is all gibberish, just know: LLMs are fancy autocomplete. If you've ever had autocomplete on your phone produce a nonsensical sentence when you press the suggested word 20 times in a row, you've experienced a simpler version of the same limitation.