Fake date tests reveal LLM biases in macro forecasts
CBR Paper Auf Deutsch lesen

Fake date tests reveal LLM biases in macro forecasts

A Bank of Russia working paper introduces "fake date tests" to evaluate biases in large language models' (LLMs) macroeconomic forecasts. The study finds that modern LLMs exhibit lookahead and context biases in their in-sample forecasts.

Detecting hidden biases in LLM forecasts

Economists are increasingly applying large language models (LLMs) to macroeconomic forecasting, often relying on in-sample accuracy evaluations.

However, LLMs are trained on vast datasets that can include future information relative to a specific forecast date, leading to potential lookahead bias.

Furthermore, their knowledge context can differ over time, creating context bias.

These biases can distort retrospective accuracy assessments, making it difficult to extrapolate in-sample performance to out-of-sample behavior.

To address this, a Bank of Russia working paper proposes a family of prompt sensitivity tests, focusing on two "fake date tests" designed to detect these specific biases.

The authors argue that failing these tests should cast serious doubt on the validity of any methodology used for retrospective accuracy evaluation of LLMs in forecasting.

Two tests for lookahead and context

The paper details two specific tests.

"Fake date test I" aims to detect lookahead bias by comparing two forecasts.

Both use the same input information, but one forecast date is set far in the past, while the other is shifted significantly into the future, beyond the LLM's release date.

A divergence in these forecast distributions suggests the LLM is failing to ignore future data.

"Fake date test II" then addresses context bias.

In this test, the information cutoff date is aligned with the first forecast period.

If Fake date test I is passed, any subsequent mismatch in forecasts from Fake date test II indicates a bias stemming from the LLM's varying knowledge context over time.

In-sample accuracy under scrutiny

The findings cast serious doubt on the validity of current LLM retrospective accuracy evaluations in macroeconomic forecasting.

Researchers must address these inherent biases to ensure reliable out-of-sample performance estimates.

Without robust testing, the practical utility of LLMs in this domain remains limited.