Pinned post
Anthropic details how it built its multi-agent Claude Research system, claiming significant improvements in internal evaluations over single-agent systems (Anthropic)
Anthropic : Anthropic details how it built its multi-agent Claude Research system, claiming significant improvements in internal evaluati...
29 March 2023
Why exams intended for humans might not be good benchmarks for LLMs like GPT-4 - 2023-03-29 16:07:00Z
Title:Why exams intended for humans might not be good benchmarks for LLMs like GPT-4 Summary: Training data contamination and other factors mean LLMs like GPT-4 succeeding on human exams might not be a good measure of their abilities. Link: Why exams intended for humans might not be good benchmarks for LLMs like GPT-4