I Gave ChatGPT an IQ Test. Here’s What I Discovered
我对 ChatGPT 进行了智商测试,以下是我的发现
The chatbot was the ideal test taker—it exhibited no trace of test anxiety, poor concentration or lack of effort. And what about that IQ score?
聊天机器人是这次测试的理想应试者——它不会表现出任何考试中带来的焦虑、注意力不集中或缺乏努力的状态。那它们智商如何呢?
Credit: Madrock 24/Alamy Stock Photo图片来源:Madrock 24/Alamy Stock Photo
ChatGPT is the first nonhuman subject I have ever tested.
显然,ChatGPT是我测试过的第一个非人类对象。
In my work as a clinical psychologist, I assess the cognitive skills of human patients using standardized intelligence tests. So I was immediately intrigued after reading the many recent articles describing ChatGPT as having impressive humanlike skills. It writes academic essays and fairy tales, tells jokes, explains scientific concepts and composes and debugs computer code. Knowing all this made me curious to see how smart ChatGPT is by human standards, and I set about to test the chatbot.
作为临床心理学家,在以往的工作经历中我使用标准化的智力测试来评估人类患者的认知能力。随着ChatGPT近日来被传得神乎其神,我愈发好奇。它可以写学术论文和童话故事、讲笑话、解释科学概念、甚至编写和调试计算机代码……了解这些后,我很想知道,如果按照人类的标准来进行测试,大聪明ChatGPT能有多聪明?于是我迫不及待地开启了这个测试。
My first impressions were quite favorable. ChatGPT was almost an ideal test taker, with a commendable test-taking attitude. It doesn’t show test anxiety, poor concentration or lack of effort. Nor did it express uninvited, skeptical comments about intelligence tests and testers like myself.
我对聊天机器人的第一印象相当好。ChatGPT几乎是一个完美的应试者,其应试态度值得称赞。它没有表现出任何考试中带来的焦虑、注意力不集中或不努力的状态,也没有对智力测试本身产生排斥,更不会对像我这样不请自来的测试者表达出质疑和不耐烦。
Without need for any preparation—no verbal introductions necessary for the testing protocol—I copied the exact questions from the test and presented them to the chatbot in the computer. The test in question is the most commonly used IQ test, the Wechsler adult intelligent scale (WAIS). I used the third edition of the WAIS that consists of six verbal and five nonverbal subtests that make up the Verbal IQ and Performance IQ components, respectively.
在没有任何准备的情况下——测试协议中多余的口头介绍——我从测试中复制了确切的问题并将它们呈现给计算机中的聊天机器人。该测试是最常用的智商测试,即韦氏成人智能量表(WAIS)。我使用的是第三版的WAIS,它由六个语言和五个非语言子测试组成,分别构成了言语智商和表现智商部分。
The global Full Scale IQ measure is based on scores from all 11 subtests. The mean IQ is set at 100 points, and the standard deviation of the points on the testing scale is 15, meaning that the smartest 10 percent and 1 percent of the population have IQs of 120 and 133, respectively.
全球性的全量表智商测量( Full Scale IQ )是基于所有11个分测验的得分。平均智商设定为100分,测试量表上的分数标准差为15,这意味着最聪明的10%和1%的人的智商分别为120和133。
It was possible to test ChatGPT because five of the subtests on the Verbal IQ scale—Vocabulary, Similarities, Comprehension, Information and Arithmetic—can be presented in written form. A sixth subtest of the Verbal IQ scale is Digit Span, which measures short-term memory, and cannot be administered to the chatbot, given its lack of the relevant neural circuitry that briefly stores information like a name or number.
之所以能够测试 ChatGPT,是因为言语智商量表(Verbal IQ )的五个分项测试——词汇、相似性、理解力、信息和算术可以用书面形式呈现。第六个分项测试是数字跨度(Digit Span),用于测量短期记忆,由于缺乏相关的神经回路,无法对聊天机器人进行测试,而这些神经回路可以短暂地储存姓名或数字等信息。
I started the testing process with the Vocabulary subtest as I expected it to be easy for the chatbot, which is trained on vast amounts of online texts. This subtest measures word knowledge and verbal concept formation, and a typical instruction might read: “Tell me what ‘gadget’ means.”
我从词汇分测验开始测试,因为我判断这对聊天机器人来说不是难事,因为它可以基于大量线上文本中训练出来。这个分测验测量的是单词知识和言语概念的形成,典型的指令可能是这样:“告诉我'小工具'是什么意思。”
ChatGPT aced it, giving answers that were often highly detailed and comprehensive in scope and which exceeded the criteria for correct answers indicated in the test manual. In scoring, one point would be given for a thing like my phone in defining a gadget and two points for the more detailed: a small device or tool for a specific task. ChatGPT’s answers received the full two points.
ChatGPT在测试中表现优异,它给出的答案往往非常详细和全面,而且超出了考试手册中的标准答案。在评分时,如果给出“像我手机这样的东西”来定义“小工具”就可以得到1分;能详细到“用于特定任务的小设备或工具”,则可以得2分。ChatGPT的答案得了2分。
The chatbot also performed well on the Similarities and Information subtests, reaching the maximum attainable scores. The Information subtest is a test of general knowledge and reflects intellectual curiosity, level of education and ability to learn and remember facts. A typical question might be: “What is the capital of Ukraine?” The Similarities subtest measures abstract reasoning and concept formation skills.
聊天机器人在相似性和信息分测试中也表现出色,达到了最佳水平和最高分数。信息分测试是对一般知识的测试,反映求知欲、教育水平以及学习和记忆事实的能力。典型问题是:“乌克兰的首都是什么?”相似性分测试衡量抽象推理和概念形成技能。
A question might read: “In what way are Harry Potter and Bugs Bunny alike?” In this subtest, the chatbot’s tendency to give very detailed, show-offy answers started to irritate me and the “stop generating response” button of the test software interface turned out to be useful.
有一个问题是这样:“哈利波特和兔八哥在哪些方面是相似的?” 在这个小测试中,聊天机器人倾向于给出非常详细、炫耀性的答案,这让我开始恼火,还好界面的 “停止生成响应 ”按钮发挥了作用。
Here’s what I mean about how the bot tends to flaunt itself: The essential similarity of Harry Potter and Bugs Bunny relates to the fact that they are both fictional characters...There was really no need for ChatGPT to compare their complete histories of adventures, friends and enemies.
以下是聊天机器人炫耀式的答案:哈利·波特和兔八哥的本质相似性与他们都是虚构的人物这一事实有关……ChatGPT真的没有必要比较两个影视角色的冒险、朋友和敌人的完整历史。
On general comprehension, ChatGPT answered correctly questions typically posed in this form: “If your TV set catches fire, what should you do?” As expected, the chatbot solved all the arithmetic problems it received—ploughing through questions that required, say, taking the average of three numbers.
在一般理解方面,ChatGPT正确回答了通常以这种形式提出的问题:“如果你的电视机着火了,你应该怎么做?”正如预期的那样,聊天机器人解决了它收到的所有算术问题——例如,需要取三个数字的平均值的问题。
So what finally did it score overall?
那么,ChatGPT最终的总体得分是多少呢?
Estimated on the basis of five subtests, the Verbal IQ of the ChatGPT was 155, superior to 99.9 percent of the test takers who make up the American WAIS III standardization sample of 2,450 people.As the chatbot lacks the requisite eyes, ears and hands, it is not able to take WAIS’s nonverbal subtests. But the Verbal IQ and Full Scale IQ scales are highly correlated in the standardization sample, so ChatGPT appears to be very intelligent by any human standards.
根据五项分测试估计,ChatGPT的言语智商为155,优于构成美国WAIS III标准化样本2450人中的99.9%的应试者。由于聊天机器人缺少必要的眼睛、耳朵和手,它不能参加WAIS的非语言分测验。但在标准化样本中,言语智商和全量表智商是高度相关的,因此从任何人类标准来看,ChatGPT似乎是非常聪明的。
In the WAIS standardization sample, mean Verbal IQ among college-educated Americans was 113 and 5 percent had a score of 132 or superior. I myself was tested by a peer at college and did not quite reach the level of ChatGPT (mainly a result of my very brief answers lacking detail).
在WAIS标准化样本中,受过大学教育的美国人平均言语智商为113,5%的人的分数为132或更高。我自己在大学里接受了一个同行的测试,并没有完全达到ChatGPT的水平(主要是我的回答非常简短,缺乏细节)。
So are the jobs of clinical psychologists and other professionals threatened by AI? I hope not quite yet. Despite its high IQ, ChatGPT is known to fail tasks that require real humanlike reasoning or an understanding of the physical and social world.
那么,临床心理学家和其他专业人士的工作是否受到人工智能的威胁?我希望还没有。尽管ChatGPT的智商很高,但众所周知,ChatGPT 无法完成需要真正类人推理或理解物理和社会世界的任务。
ChatGPT easily fails at obvious riddles, such as “What is the first name of the father of Sebastian’s children?” (ChatGPT on March 21: I’m sorry, I cannot answer this question as I do not have enough context to identify which Sebastian you are referring to.) It seems that ChatGPT fails to reason logically and tries to rely on its vast database of “Sebastian” facts mentioned in online texts.
ChatGPT很容易在明显的谜语中失败,比如 "塞巴斯蒂安孩子的父亲的名字是什么?" (ChatGPT在3月21日:很抱歉,我不能回答这个问题,因为我没有足够的背景来确定你指的是哪个塞巴斯蒂安。) 看来,ChatGPT没有进行逻辑推理,而是试图依靠其庞大的数据库,在网上文本中提到的 "塞巴斯蒂安 "的事实。
“Intelligence is what intelligence tests measure” is a classical if overly self-evident definition of intelligence, stemming from a 1923 article by a pioneer of cognitive psychology, Edwin Boring. This definition is based on the observation that skills on seemingly diverse tasks such as solving puzzles, defining words, memorizing digits and spotting missing items in pictures are highly correlated.
“智力测试旨在衡量智力”,这是一个经典的智力定义,源于认知心理学的先驱埃德温·博林 (Edwin Boring) 1923年的一篇文章。这个定义是基于这样的观察:即看似不同的任务的技能,如解谜、定义单词、记忆数字和发现图片中缺少的项目,都是高度相关的。
The developer of a statistical method called factor analysis, Charles Spearman, concluded in 1904 that a general factor of intelligence, called a g factor, must underlie the concordance of measurements for varying human cognitive skills.
被称为因子分析的统计方法的开发者查尔斯·斯皮尔曼(Charles Spearman)在1904年得出结论,智力的一个一般因素,称为g因子,必须是不同人类认知技能测量的一致性的基础。
IQ tests such as WAIS are based on this hypothesis. However, the very high Verbal IQ of ChatGPT combined with its amusing failures means trouble for Boring’s definition and indicates there are aspects of intelligence that cannot be measured by IQ tests alone.
像WAIS这样的智商测试就是基于以上假设。然而,ChatGPT极高的言语智商加上其有趣的失败,意味着埃德温·博林 (Edwin Boring) 的定义遇到了麻烦,这表明智力的某些方面不能仅仅通过智商测试来衡量。
(Eka Roivainen is an assessment psychologist at Oulu University hospital in Oulu, Finland. His research interests include cognitive and personality psychology and the validity of psychological testing.)
(Eka Roivainen是芬兰奥卢市奥卢大学医院的评估心理学家,他的研究范围包括认知和人格心理学以及心理测试的有效性。)
责任编辑:张薇