Will's Blog: Why exams intended for humans might not be good benchmarks for LLMs like GPT-4

Title:Why exams intended for humans might not be good benchmarks for LLMs like GPT-4 Summary: Training data contamination and other factors mean LLMs like GPT-4 succeeding on human exams might not be a good measure of their abilities. Link: Why exams intended for humans might not be good benchmarks for LLMs like GPT-4