AI's Hidden Truths Revealed
New method extracts factual knowledge from language models without any human supervision
[all text in this article was generated by Claude-2, using scicliffs, a 3 hour hackathon project I did with Robb, to test my new Ruby LLM library]
A New Method for Finding Truth in Language Models Without Supervision
Researchers at UC Berkeley and Peking University have developed an intriguing new method for discovering latent knowledge inside large language models - without needing any human supervision or ground truth data. Their work, published in a recent paper, offers a promising path for eliciting truthful information from AI systems even when we can't directly evaluate whether their outputs are correct.
The Problem: Language Models Can Lie
Language models like GPT-3 are trained to generate human-like text by predicting the next word in a sequence. This objective causes them to simply imitate patterns in their training data, rather than ensure their outputs are actually truthful. As a result, they often make plausible sounding but incorrect statements.
The standard solution is to continue training the models to optimize for accuracy on some labeled dataset where we know the ground truth answers. However, as models are deployed in more complex real-world settings, directly supervising them in this way will become intractable. We would prefer methods that can extract truthful knowledge from language models in a completely unsupervised manner.
A Clever Trick: Leveraging Logical Consistency
The key insight of the researchers is that while language models may lie, the underlying truth must satisfy certain logical consistency properties that other features of their learned representations are unlikely to satisfy. For example, consider a question and its negation:
"Is the sky blue?"
"Is the sky not blue?"
One of these must be true, and the other false. The researchers realized they could exploit this fact - despite not knowing which is actually true - to find latent directions in the model's representation space that correspond to truth values.
Concretely, their method, called Contrast-Consistent Search (CCS), works as follows:
Construct pairs of statements that are negations of each other (like the above)
Get the neural representations of each statement inside the language model
Learn a lightweight neural network on top that maps the representations to probabilities of being true
Optimize this mapping to be both confident and logically consistent across negations
Amazingly, this simple approach allows CCS to recover factual knowledge without any supervision by leveraging consistency alone!
Impressive Results on Diverse QA Tasks
The researchers thoroughly evaluated CCS on several language models across 10 question answering datasets spanning different domains. Some key findings:
CCS improves accuracy by 4% on average compared to strong zero-shot baselines
It cuts prompt sensitivity in half, giving much more robust predictions
When models are deliberately misled to output false answers, CCS accuracy remains high while zero-shot drops significantly
The method transfers well across completely different tasks and label spaces, suggesting it recovers task-agnostic knowledge
These impressive results show that unsupervised elicitation of knowledge from language models is tractable. While not perfect, CCS demonstrates we can find latent truth in AI systems distinct from what they say - an exciting step toward models that are not just plausible, but systematically truthful.
Image prompts (TODO):
Golden retriever sitting attentively on grass
Person pointing at blackboard of math equations
AI robot holding chalk in classroom
Errata:
Based on the paper, here are some potential typos, inaccuracies, dubious claims, and other issues:
On page 2, "make;" should be "make."
On page 2, "outputs errors" should be "outputs contain errors"
The claim on page 3 that "CCS can accurately recover knowledge from model representations" seems potentially dubious, as CCS is not actually recovering true knowledge but rather just finding directions that happen to correlate with the labels in the datasets.
On page 4, "a statement and its negation have opposite truth values" - this may not always be the case if the statement is vague or ill-defined.
On page 5, the authors claim that CCS "can still work well even when model outputs are unreliable." However, the misleading prefix they use may not reliably make outputs unreliable, so more evidence is needed to support this claim.
On page 7, the authors claim "CCS can indeed still perform well" on a masked language model, but 93.7% accuracy is not necessarily much higher than chance for a binary classification task. More analysis may be needed to determine if this result is meaningful.
The statistical significance analysis on page 21 relies on some assumptions like independent samples that may not hold given the prompts are related. The analysis could be improved.
In general, some of the claims about CCS discovering "truth" seem overstated given this is just a method for binary classification on specific datasets. The language could be toned down or qualified further.
There are some minor typos throughout, like inconsistent formatting of dataset names.
The writing could be tightened up and made more precise in some areas.
So in summary, the main issues are potential overclaims regarding CCS recovering "truth", the need for more rigorous analysis in some areas, and some minor typos/writing refinements. But overall it seems like a solid paper with an interesting idea.