![](/static/253f0d9b/assets/icons/icon-96x96.png)
![](https://lemmy.world/pictrs/image/8aead832-799f-4d34-a20d-eae5b621a9b1.jpeg)
The problem is that they are prone to making up why they are correct too.
There’s various techniques to try and identify and correct hallucinations, but they all increase the cost and none are a silver bullet.
But the rate at which it occurs decreased with the jump in pretrained models, and will likely decrease further with the next jump too.
It’s right in the research I was mentioning:
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Find the section on the model’s representation of self and then the ranked feature activations.
I misremembered the top feature slightly, which was: responding “I’m fine” or gives a positive but insincere response when asked how they are doing.