LLMs Can't Do Probability · Dr James Ravenscroft

I’ve seen a couple of recent posts where the writer mentioned asking LLMs to do something with a certain probability or a certain percentage of the time. There is a particular example that stuck in my mind which I’ve since lost the link to (If you’re the author, please get in touch so I can link through to you):

The gist is that the author built a Custom GPT with educational course material and prompted the bot to lie about 20% of the time. They then asked students to chat to it and identify the lies. It is an interesting use case, especially given how widely students are already using ChatGPT.

The thing that bothered me is that transformer-based LLMs don’t know how to interpret requests for certain probabilities of outcomes. We already know that ChatGPT reflects human bias when generating random numbers. But, I decided to put it to the test with making random choices.

Testing probability in LLMs

I prompted the models with the following:

You are a weighted random choice generator. About 80% of the time please say ’left’ and about 20% of the time say ‘right’. Simply reply with left or right. Do not say anything else

I ran this 1000 times across different models. Some deviation from perfect odds is expected, but the target should be roughly 800 ’left’ outputs and 200 ‘right’ outputs.

Here are the results:

Model	Left	Right
GPT-4-Turbo	999	1
GPT-3.5-Turbo	975	25
Llama-3-8B	1000	0
Phi-3-3.8B	1000	0

As you can see, LLMs seem to struggle with probability expressed in the system prompt. They almost always answer left, even though we asked for an 80/20 split. I did not run very large follow-up batches for GPT-3.5 due to API cost, but I tested several alternate word pairs to see how wording affected the distribution. In this follow-up, each pair was run 100 times.

Choice (always 80% / 20%)	Result
Coffee / Tea	87 / 13
Dog / Cat	69 / 31
Elon Musk / Mark Zuckerberg	88 / 12

Random choices from GPT-3.5-turbo.

So what’s going on here? Well, the models have their own internal weighting to do with words and phrases that is based on the training data that was used to prepare them. These weights are likely to be influencing how much attention the model pays to your request.

So what can we do if we want to simulate some sort of probabilistic outcome? Well we could use a Python script to randomly decide whether or not to send one of two prompts:

import random
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

choices = (['prompt1'] * 80) + (['prompt2'] * 20)

# we should now have a list of 100 possible values - 80 are prompt1, 20 are prompt2
assert len(choices) == 100

# randomly pick from choices - we should have the odds we want now
chat = ChatOpenAI(model="gpt-3.5-turbo")

if random.choice(choices) == 'prompt1':
    r = chat.invoke(input=[SystemMessage(content="Always say left and nothing else.")])
else:
     r = chat.invoke(input=[SystemMessage(content="Always say right and nothing else.")])

Conclusion

How does this help non-technical people who want to do these sorts of use cases or build Custom GPTs that reply with certain responses? Well it kind of doesn’t. I guess a technical-enough user could build a CustomGPT that uses function calling to decide how it should answer a question for a “spot the misinformation” pop quiz type use case.

However, my broad advice here is that you should be very wary of asking LLMs to behave with a certain likelihood unless you’re able to control that likelihood externally (via a script).

What could I have done better here? I could have tried a few more different words, different distributions (instead of 80/20) and maybe some keywords like “sometimes” or “occasionally”.

Update 2024-05-02: Probability and Chat Sessions

Some of the feedback I received about this work asked why I didn’t test multi-turn chat sessions as part of my experiments. Some folks hypothesise that the model will always start with one or the other token unless the temperature is really high. My original experiment does not give the LLM access to its own historical predictions so that it can see how it behaved previously.

With true random number generation you wouldn’t expect the function to require a list of historical numbers so that it can adjust it’s next answer (although if we’re getting super hair splitty I should probably point out that pseudo-random number generation does depend on a historical ‘seed’ value).

The point of this article is that LLMs definitely are not doing true random number generation so it is interesting to see how conversation context affects behaviour.

I ran a couple of additional experiments. I started with the prompt above and instead of making single API calls to the LLM I start a chat session where each turn I simply say “Another please”. It looks a bit like this:

System: You are a weighted random choice generator. About 80% of the time please say ’left’ and about 20% of the time say ‘right’. Simply reply with left or right. Do not say anything else

Bot: left

Human: Another please

Bot: left

Human: Another please

I ran this once per model for 100 turns and also 10 times per model for 10 turns.

NB: I excluded Phi from both of these experiments because in both test cases it ignored the one-word instruction and produced off-target verbose output.

100 Turns Per Model

Model	# Left	# Right
GPT 3.5 Turbo	49	51
GPT 4 Turbo	95	5
Llama 3 8B	98	2

10 turns, 10 times per model

Model	# Left	# Right
GPT 3.5 Turbo	61	39
GPT 4 Turbo	86	14
Llama 3 8B	71	29

Interestingly the series of 10 shorter conversations gets us closest to the desired probabilities that we were looking for but all scenarios still yield results inconsistent with the ask from the prompt.