Wildly different capabilities of image vision between Sonnet 3.5 Web and API?
I'm testing different LLMs capabilities in detecting AI generated vs real human photos. If you go to a site which generates fake human faces (eg: this person does not exist), these faces can fool most humans (even me), yet not Sonnet 3.5 in the web interface. On the API interface however, the same Claude model seems fooled.
The following prompt *is exactly the same prompt* used in the one-shot web interface of Claude 3.5 Sonnet (new) and the API Model claude-3-5-sonnet-20241022, api version 2023-06-01, (and for the api both default temperature and an explicit temperature of 0.0, 1.0, and other reasonable values make no difference):
please give a confidence % as to whether this image is of a real human or instead is of a fake human (eg: AI generated) or a picture of something else besides a human. Reply only with your % estimate guess (nothing else, your answer will be used programmatically) as to whether it's a real human with 0% having absolute certainty it's not a real human, 100% absolute certainty it is a real human.
For the Claude Web interface, a very acceptable reply of 20% is given. If asked to explain its reasoning, it will detail why it's AI generated.
For the API, an unacceptable response of 85% is given. If asked to explain why, it states the seemingly opposite reasoning to the web interface.
Now, I understand the web interface has different prompting to the API behind the scenes, and that LLMs aren't built for statistical reasoning. Nevertheless, after many, many iterations of different images I'm quite confident that Claude Web 3.5 Sonnet model *is predictably good at detecting fake faces*, whilst the API 3.5 Sonnet 20241022 version *is predictably not good at all*. This has been the case over multiple prompt rewordings, different temperature settings, etc.
What is going on? Are the vision capabilities of the two models different behind the scenes, are the custom behind-the-scenes prompts of the Claude Web significantly better to make it more reliable for image vision compared to the API, or something else?
Interested in reasoned thoughts from other developers. I thought these two Claude models were the same.