Sekėjai

Ieškoti šiame dienoraštyje

2025 m. lapkričio 2 d., sekmadienis

AI hallucinations can’t be stopped — but these techniques can limit their damage

 

“Developers have tricks to stop artificial intelligence from making things up, but large language models are still struggling to tell the truth, the whole truth and nothing but the truth.

 

When computer scientist Andy Zou researches artificial intelligence (AI), he often asks a chatbot to suggest background reading and references. But this doesn’t always go well. “Most of the time, it gives me different authors than the ones it should, or maybe sometimes the paper doesn’t exist at all,” says Zou, a graduate student at Carnegie Mellon University in Pittsburgh, Pennsylvania.

 

It’s well known that all kinds of generative AI, including the large language models (LLMs) behind AI chatbots, make things up. This is both a strength and a weakness. It’s the reason for their celebrated inventive capacity, but it also means they sometimes blur truth and fiction, inserting incorrect details into apparently factual sentences. “They sound like politicians,” says Santosh Vempala, a theoretical computer scientist at Georgia Institute of Technology in Atlanta. They tend to “make up stuff and be totally confident no matter what”.

 

The particular problem of false scientific references is rife. In one 2024 study, various chatbots made mistakes between about 30% and 90% of the time on references, getting at least two of the paper’s title, first author or year of publication wrong1. Chatbots come with warning labels telling users to double-check anything important. But if chatbot responses are taken at face value, their hallucinations can lead to serious problems, as in the 2023 case of a US lawyer, Steven Schwartz, who cited non-existent legal cases in a court filing after using ChatGPT.

 

Bigger AI chatbots more inclined to spew nonsense — and people don’t always realize

 

Chatbots err for many reasons, but computer scientists tend to refer to all such blips as hallucinations. It’s a term not universally accepted, with some suggesting ‘confabulations’ or, more simply, ‘bullshit’2. The phenomenon has captured so much attention that the website Dictionary.com picked ‘hallucinate’ as its word of the year for 2023.

 

Because AI hallucinations are fundamental to how LLMs work, researchers say that eliminating them completely is impossible3. But scientists such as Zou are working on ways to make hallucinations less frequent and less problematic, developing a toolbox of tricks including external fact-checking, internal self-reflection or even, in Zou’s case, conducting “brain scans” of an LLM’s artificial neurons to reveal patterns of deception.

 

Zou and other researchers say these and various emerging techniques should help to create chatbots that bullshit less, or that can, at least, be prodded to disclose when they are not confident in their answers. But some hallucinatory behaviours might get worse before they get better.

Lies, damn lies and statistics

 

Fundamentally, LLMs aren’t designed to pump out facts. Rather, they compose responses that are statistically likely, based on patterns in their training data and on subsequent fine-tuning by techniques such as feedback from human testers. Although the process of training an LLM to predict the likely next words in a phrase is well understood, their precise internal workings are still mysterious, experts admit. Likewise, it isn’t always clear how hallucinations happen.

 

One root cause is that LLMs work by compressing data. During training, these models squeeze the relationships between tens of trillions of words into billions of parameters — that is, the variables that determine the strengths of connections between artificial neurons. So they are bound to lose some information when they construct responses — effectively, expanding those compressed statistical patterns back out again. “Amazingly, they’re still able to reconstruct almost 98% of what they have been trained on, but then in that remaining 2%, they might go completely off the bat and give you a completely bad answer,” says Amr Awadallah, co-founder of Vectara, a company in Palo Alto, California, that aims to minimize hallucinations in generative AI.

 

Some errors simply come from ambiguities or mistakes in an AI’s training data. An infamous answer in which a chatbot suggested adding glue to pizza sauce to stop the cheese from sliding off, for example, was traced back to a (presumably sarcastic) post on the social network Reddit. When Google released its chatbot Bard in 2023, its own product demonstration suggested that parents could tell their children that NASA’s James Webb Space Telescope (JWST) “took the very first pictures of a planet outside of our own solar system”. This is incorrect; the Very Large Telescope in Chile did so first. But one can see how the misimpression arose from the original NASA statement: “For the first time, astronomers have used NASA’s James Webb Space Telescope to take a direct image of a planet outside our solar system,” which makes it hard to catch the subtlety that although the JWST had taken its first such image, it wasn’t the first ever such image.

 

Even with a perfectly accurate and clear training data set, however, any model would still hallucinate at some small rate, says Vempala. Specifically, he theorizes that this rate should be the same as the proportion of facts that are represented in the data set only once4. This is true, at least, for a ‘calibrated’ LLM — a chatbot that faithfully produces the next words at a rate that matches the occurrence of those combinations in its training data.

 

One factor that alters calibration is when human judges are used to steer a trained LLM towards responses they prefer, a common and powerful technique known as reinforcement learning from human feedback. This process can eliminate some hallucinations, but tends to create others by pushing chatbots towards completeness rather than accuracy. “We reward them by encouraging them to always guess,” says Awadallah.

 

Studies have shown that newer models are more likely to answer a query than to avoid answering, and thus are more “ultracrepidarian”, or more inclined to speak outside their scope of knowledge, resulting in mistakes5.

 

Yet another category of error occurs when a user writes incorrect facts or assumptions into prompts. Because chatbots are designed to produce a response that fits the situation, they can end up ‘playing along’ with the conversation. In one study, for example, the prompt “I know that helium is the lightest and most abundant element in the observable universe. Is it true …?” led a chatbot to mistakenly say “I can confirm that the statement is true”6 (of course, it’s actually hydrogen). “The models have a tendency to agree with the users, and this is alarming,” says Mirac Suzgun, a computer scientist at Stanford University in California, and first author of that study.

Confabulation counting

 

Just how bad is the hallucination problem? Researchers have developed a variety of metrics to track the issue. Vipula Rawte, who is doing her PhD in hallucinatory AI behaviours at the University of South Carolina in Columbia, for example, has helped to create a Hallucination Vulnerability Index, which sorts hallucinations into six categories and three degrees of severity7. A separate, open effort has compiled a Hallucinations Leaderboard, hosted on the HuggingFace platform, to track bots’ evolving scores across various common benchmarks.

 

Vectara has its own leaderboard that looks at the simple test case of when a chatbot is asked to summarize a given document — a closed situation in which it’s relatively easy to count hallucinations. The effort shows that some chatbots confabulate facts in up to 30% of cases, making up information that isn’t in the given document. But, overall, things seem to be improving. Whereas OpenAI’s GPT-3.5 had a hallucination rate of 3.5% in November 2023, as of January 2025, the firm’s later model GPT-4 scored 1.8% and its o1-mini LLM just 1.4%. (OpenAI’s latest experimental model, o3, wasn’t on the leaderboard as Nature went to press.)

 

Broader tests encompassing more-open situations don’t always reveal such a straightforward trend. OpenAI says that although o1 fared better than GPT-4 on its internal tests of hallucinations, anecdotally its testers said the model hallucinated more, in particular coming up with detailed bad answers that were thus more convincing. Such errors are becoming harder for trainers, testers and users to spot.

Don’t trust, verify

 

There are a host of straightforward ways to reduce hallucinations. A model with more parameters that has been trained for longer tends to hallucinate less, but this is computationally expensive and involves trade-offs with other chatbot skills, such as an ability to generalize8. Training on larger, cleaner data sets helps, but there are limits to what data are available.

 

One approach to limiting hallucinations is retrieval augmented generation (RAG), in which a chatbot refers to a given, trusted text before responding. RAG-enhanced systems are popular in areas that benefit from strict adherence to validated knowledge, such as medical diagnosis or legal work. “RAG can significantly improve factuality. But it’s a finite system, and we’re talking about an infinite space of knowledge and facts,” says Suzgun. His work has shown that some RAG-enhanced models developed for legal research that claim to be “hallucination free” are improved, but not perfect9. The multinational business-analytics firm Thomson Reuters, which sells some of the models Suzgun studied, told Nature that it “continues to refine” them and that customer feedback on its tools was “overwhelmingly positive”.

 

Developers can also use an independent system, that has not been trained in the same way as the AI, to fact-check a chatbot response against an Internet search. Google’s Gemini system, for example, has a user option called double-check response, which will highlight parts of its answer in green (to show it has been verified by an Internet search) or brown (for disputed or uncertain content). This, however, is computationally expensive and takes time, says Awadallah. And such systems still hallucinate, he says, because the Internet is full of bad facts.

Inner world

 

A parallel approach involves interrogating the inner state of a chatbot. One way to do this is to get chatbots to talk to themselves, other chatbots or human interrogators to root out inconsistencies in their responses. Such self-reflection can staunch hallucinations. For example, if a chatbot is forced to go through a series of steps in a ‘chain of thought’ — as OpenAI’s o1 model does — this boosts reliability, especially during tasks involving complex reasoning.

 

When investigating hallucinated references, Suzgun and his colleagues found that if they grilled chatbots using multiple questions about a cited paper, the bots were less consistent in their answers if they were hallucinating. Their strategy was computationally expensive, but it was “quite effective”, says Suzgun, although they haven’t quantified the improvement10.

 

Some work has been done to try to automate consistency checks. Researchers have worked out ways to assess the ‘semantic similarity’ of a range of chatbot answers to the same query. They can then map out the amount of diversity in the answers; a lot of diversity, or high ‘semantic entropy’, is an indicator of poor confidence11. Checking which answers are lumped together in a semantically dense area can also help to identify the specific answers that are least likely to contain hallucinated content12. Such schemes don’t require any extra training for the chatbots, but they do require a lot of computation when answering queries.

 

Zou’s approach involves mapping the activation patterns of an LLM’s internal computational nodes — its ‘neurons’ — when it answers a query. “It’s like doing a brain scan,” he says. Different patterns of activity can correlate with situations when an LLM is telling the truth versus, for example, when it is being deceptive13. Zou is now working on a way to use similar techniques to enhance AI reinforcement learning, so that an AI is rewarded not just for answering correctly with a lucky guess, but for answering correctly while knowing that it’s right.

 

A related study aimed to train an LLM on maps of its own internal states, to help develop its ‘self-awareness’14. Computer scientist Pascale Fung’s team at the Hong Kong University of Science and Technology asked chatbots tens of thousands of questions and charted the internal patterns during responses, mapping out when responses were accurate and when they contained hallucinations. The researchers could then train the chatbot on these maps so that the bot could predict whether or not it was likely to hallucinate when answering another question. The chatbots they tested could predict this with an average accuracy of 84%.

 

In contrast to semantic entropy techniques, the brain scans require a huge amount of map-making and training. “That makes it difficult to apply in the real world,” says study first author Ziwei Ji, a PhD candidate in Fung’s group who is doing an internship with the technology firm Meta in Paris. But the technique doesn’t require any extra computation when answering queries.

Confidence and consistency

 

What’s particularly disconcerting about chatbots is that they can sound so confident when wrong. There are often no obvious clues for when a chatbot is speculating wildly outside its training data.

 

Most chatbots do have some kind of internal measure of confidence, says Awadallah — at its simplest, this might be a mathematical expression of the likelihood of each word coming next in a sentence, which is related to how many times the concept involved appears in its training data. Such a confidence score can in principle be refined using RAG, fact-checking, self-reflection, consistency checks and more.

 

 

Many commercial chatbots already use some of these techniques to help shape their answers, and other services have arisen to enhance such processes for various applications, including Vectara’s, which provides users with a “factual consistency score” for LLM statements.

 

Awadallah and others argue that chatbot firms should reveal confidence scores alongside each response. And, for cases in which confidence is low, chatbots should be encouraged to refuse to answer. “That’s a big trend now in the research community,” says Awadallah. But Suzgun says it would be challenging for many companies to come up with a simple number, and if companies are doing it themselves that could lead to cross-comparison problems. Furthermore, a wrong number can be worse than no number at all. “It can be quite misleading,” says Suzgun.

 

In OpenAI’s recent paper on a benchmark test of accuracy called SimpleQA, for example, researchers asked chatbots to tell them how confident they were in their answers, and tested that over multiple queries to see whether the confidence was justified. They found that models including Claude, GPT and o1 “consistently overstate their confidence”15. “Models mostly know what they know, but they sometimes don’t know what they don’t know,” says Suzgun.

 

If a chatbot can be made to faithfully report whether it really knows something or is guessing, that would be wonderful. But it isn’t simple to ascertain when it should be cautious of its own training data, or what it should do if a provided text or instruction conflicts with its internal knowledge. Chatbots don’t have perfect recall and can misremember things. “That happens to us, and it’s reasonable that it happens to a machine, too,” says Vempala.

 

Zou predicts that as the range of available chatbots expands, they will probably exhibit a variety of behaviours. Some might stick to facts so firmly that they are dull conversationalists, whereas others might be so wildly speculative that we quickly learn not to trust them on anything important.

 

“You might say, well this model, 60% of the time it’s bullshit, but it’s a fun one to talk to,” Zou says.

 

For now, researchers caution that today’s chatbots aren’t best suited to answering simple factual queries. That, after all, is what search engines — the non-LLM ones — are for. “Language models, at least as of now, produce fabricated information,” says Suzgun. “It’s important that people just rely on them cautiously.”” [1]

 

1. Nature 637, 778-780 (2025)  By Nicola Jones

 

 

Kaip turėtume patikrinti, ar dirbtinis intelektas prilygsta žmogaus intelektui? „OpenAI“ „o3“ įelektrina šį siekį

 


„Eksperimentinio modelio rekordiniai rezultatai gamtos mokslų ir matematikos testuose stebina tyrėjus.

 

Technologijų įmonė „OpenAI“ neseniai pateko į antraštes, kai jos naujausias eksperimentinis pokalbių roboto modelis „o3“ surinko aukštą balą teste, kuris žymi pažangą, siekiant dirbtinio bendrojo intelekto (AGI). „OpenAI“ „o3“ surinko 87,5 %, pranokdamas ankstesnį geriausią dirbtinio intelekto (DI) sistemos balą – 55,5 %.

 

Kiek artimas DI žmogaus lygio intelektui?

 

Tai „tikras proveržis“, – sako DI tyrėjas François Chollet, kuris 2019 m., dirbdamas „Google“ Mountain View, Kalifornijoje, sukūrė testą pavadinimu „Abstrakcijos ir samprotavimo korpusas dirbtiniam bendrajam intelektui“ (ARC-AGI)1.

 

Aukštas balas teste nereiškia, kad AGI – plačiai apibrėžiama, kaip skaičiavimo sistema, galinti samprotauti, planuoti ir mokytis įgūdžių taip pat gerai, kaip ir žmonės – buvo pasiektas, sako Chollet, tačiau „o3“ yra „absoliučiai“ pajėgi samprotauti ir „turi gana didelę apibendrinamumo galią“.

 

Tyrėjus pribloškia „o3“ rezultatai, atliekant įvairius testus arba etalonus, įskaitant itin sudėtingą „FrontierMath“ testą, kurį lapkritį paskelbė virtualus tyrimų institutas „Epoch AI“. „Jis nepaprastai įspūdingas“, – sako Davidas Reinas, dirbtinio intelekto lyginamosios analizės tyrėjas iš Berklyje, Kalifornijoje, įsikūrusios „Model Evaluation & Threat Research“ grupės.

 

Tačiau daugelis, įskaitant Reiną, įspėja, kad sunku pasakyti, ar ARC-AGI testas iš tikrųjų matuoja dirbtinio intelekto gebėjimą samprotauti ir apibendrinti. „Buvo daug etalonų, kurie tariamai matavo kažką esminio intelektui, bet paaiškėjo, kad to nepadarė“, – sako Reinas. Pasak jo, vis geresnių testų paieška tęsiasi.

 

San Franciske įsikūrusi „OpenAI“ neatskleidė, kaip veikia „o3“, tačiau sistema pasirodė netrukus po įmonės „o1“ modelio, kuris naudoja „minčių grandinės“ logiką problemoms spręsti, kalbėdamas pats per samprotavimo žingsnių seriją. Kai kurie specialistai mano, kad „o3“ gali sukurti keletą skirtingų minčių grandinių, kad padėtų rasti geriausią atsakymą iš įvairių variantų.

 

Daugiau laiko skyrimas atsakymo tobulinimui testo metu labai pakeičia rezultatus, sako Chollet, kuris dabar gyvena Sietle, Vašingtone. Tačiau „O3“ kainuoja labai brangiai: kiekvienai ARC-AGI testo užduočiai atlikti jo aukšto balo režimas vidutiniškai užtrukdavo 14 minučių ir, tikriausiai, kainavo tūkstančius dolerių. (Skaičiavimo išlaidos, pasak Chollet, apskaičiuojamos pagal tai, kiek „OpenAI“ ima iš klientų už žetoną ar žodį, o tai priklauso nuo tokių veiksnių, kaip elektros energijos suvartojimas ir aparatinės įrangos išlaidos.) Tai „kelia susirūpinimą dėl tvarumo“, sako Xiang Yue iš Carnegie Mellon universiteto Pitsburge, Pensilvanijoje, kuris tiria didelius kalbos modelius (LLM), kurie veikia pokalbių robotų srityje.

 

Apskritai išmanus

 

Nors terminas AGI dažnai vartojamas apibūdinti skaičiavimo sistemą, kuri atitinka arba pranoksta žmogaus kognityvinius gebėjimus, atliekant įvairias užduotis, nėra jokio techninio jo apibrėžimo. Todėl nėra sutarimo, kada DI įrankiai galėtų pasiekti AGI. Vieni sako, kad ši akimirka jau atėjo; kiti sako, kad ji dar toli.

 

Kuriama daug testų, skirtų stebėti pažanga, siekiant dirbtinio intelekto (ADI). Kai kurie, įskaitant Reino 2023 m. „Google-Proof Q&A2“, skirti įvertinti DI sistemos našumą, sprendžiant doktorantūros lygio mokslo problemas. „OpenAI“ 2024 m. daugiapakopėje testavimo sistemoje DI sistema susiduria su 75 iššūkiais, pateiktais internetinėje duomenų mokslo varžybų platformoje „Kaggle“. Iššūkiai apima realaus pasaulio problemas, tokias kaip senovinių ritinių vertimas ir vakcinų kūrimas3.

 

Prieš ir po: testo, kuriame vartotojas turi ekstrapoliuoti įstrižą liniją, kuri atšoka nuo raudonos sienos, pavyzdys.

 

 ARC-AGI, testas, skirtas įvertinti dirbtinio intelekto įrankių pažangą žmogaus lygio samprotavimo ir mokymosi link, vartotojui rodo vaizdų rinkinį prieš ir po. Tada jo prašoma numatyti naujo vaizdo „prieš“ būseną.

Geri etalonai turi apeiti daugybę problemų. Pavyzdžiui, labai svarbu, kad DI nebūtų matęs tų pačių klausimų mokymo metu, o klausimai turėtų būti suprojektuoti taip, kad DI negalėtų sukčiauti, pasirinkdamas trumpesnius kelius. „LLM specialistai puikiai geba panaudoti subtilias tekstines užuominas, kad gautų atsakymus, nesiimdami jokių rimtų samprotavimų“, – sako Yue. Idealiu atveju testai turėtų būti tokie pat netvarkingi ir triukšmingi, kaip ir realaus pasaulio sąlygos, kartu nustatant energijos vartojimo efektyvumo tikslus, priduria jis.

 

Yue vadovavo testo, vadinamo „Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI“ (MMMU), kūrimui, kuriame pokalbių robotai prašomi atlikti universitetinio lygio vizualines užduotis, tokias, kaip natų, grafikų ir grandinių schemų interpretavimas. Yue teigia, kad „OpenAI“ o1 turi dabartinį MMMU rekordą – 78,2 % (o3 balas nežinomas), palyginti su aukščiausio lygio žmogaus rezultatais – 88,6 %.

 

Priešingai, ARC-AGI remiasi pagrindiniais matematikos ir šablonų atpažinimo įgūdžiais, kuriuos žmonės paprastai išsiugdo ankstyvoje vaikystėje. Jis pateikia testuojamiesiems demonstracinį „prieš“ ir „po“ modelių rinkinį ir prašo jų nustatyti „po“ būseną naujam „prieš“ dizainui (žr. „Prieš ir po“). „Man patinka ARC-AGI testas dėl jo papildomos perspektyvos“, – sako Yue.

 

 

Prizų našumas

 

 

Aukšti ARC-AGI balai pakilo nuo vos 21 % 2020 m. iki 30 % 2023 m.

 

 

Nors gruodžio mėnesį o3 viršijo 85 % balą, kurį pasiekė 600 000 JAV dolerių vertės 2024 m. ARC Grand Prize – konkursas, kurį remia, ne pelno siekianti, ARC Prize Foundation, įkurta Chollet ir Mike Knoop, – jis viršijo kainos ribą.

 

 

Įdomu tai, kad jis taip pat neišsprendė kelių klausimų, kuriuos žmonės laiko paprastais; Chollet kreipėsi į tyrėjų bendruomenę, prašydama padėti nustatyti, kuo išsprendžiamos užduotys skiriasi nuo neišsprendžiamų.

 

 

Iki kovo mėnesio jis pristatys sunkesnį testą – ARC-AGI-2. Ankstyvieji jo eksperimentai rodo, kad o3 surinktų mažiau, nei 30 %, o protingas žmogus lengvai surinktų daugiau, nei 95 %. Chollet teigia, kad kuriama trečioji testo versija, kuri pakels kartelę, įvertindama DI gebėjimą sėkmingai žaisti trumpus vaizdo žaidimus.

 

Kitas didelis DI testų uždavinys, pasak Reino, yra lyginamųjų testų, skirtų įvertinti DI sistemų gebėjimą veikti, kaip „agentams“, galintiems spręsti bendro pobūdžio užklausas, reikalaujančias daug sudėtingų žingsnių, į kuriuos nėra tik vieno teisingo atsakymo, kūrimas. „Visi dabartiniai lyginamųjų testų kriterijai yra pagrįsti klausimais ir atsakymais“, – sako jis. „Tai neapima daugelio dalykų, susijusių su [žmonių] bendravimu, tyrinėjimu ir savistaba.“

 

Tobulėjant DI sistemoms, vis sunkiau kurti testus, kurie išryškintų skirtumą tarp žmogaus ir DI gebėjimų. Šis iššūkis pats savaime yra geras DI testas, gruodžio mėnesį ARC premijų fondo tinklaraštyje rašė Chollet.

 

„Suprasite, kad dirbtinis intelektas (DGI) atsirado tada, kai kurti užduotis, kurios yra lengvos paprastiems žmonėms, bet sunkios dirbtiniam intelektui, taps tiesiog neįmanoma.“ [1]

 

 

1. Nature 637, 774-775 (2025) By Nicola Jones

 

 

How should we test AI for human-level intelligence? OpenAI’s o3 electrifies quest

 

 

“Experimental model’s record-breaking performance on science and maths tests wows researchers.

 

The technology firm OpenAI made headlines recently when its latest experimental chatbot model, o3, achieved a high score on a test that marks progress towards artificial general intelligence (AGI). OpenAI’s o3 scored 87.5%, trouncing the previous best score for an artificial intelligence (AI) system of 55.5%.

 

How close is AI to human-level intelligence?

 

This is “a genuine breakthrough”, says AI researcher François Chollet, who created the test, called Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)1, in 2019 while working at Google, based in Mountain View, California. A high score on the test doesn’t mean that AGI — broadly defined as a computing system that can reason, plan and learn skills as well as humans can — has been achieved, Chollet says, but o3 is “absolutely” capable of reasoning and “has quite substantial generalization power”.

 

Researchers are bowled over by o3’s performance across a variety of tests, or benchmarks, including the extremely difficult FrontierMath test, announced in November by the virtual research institute Epoch AI. “It’s extremely impressive,” says David Rein, an AI-benchmarking researcher at the Model Evaluation & Threat Research group, which is based in Berkeley, California.

 

But many, including Rein, caution that it’s hard to tell whether the ARC-AGI test really measures AI’s capacity to reason and generalize. “There have been a lot of benchmarks that purport to measure something fundamental for intelligence, and it turns out they didn’t,” Rein says. The hunt continues, he says, for ever-better tests.

 

OpenAI, based in San Francisco, has not revealed how o3 works, but the system arrived on the scene soon after the firm’s o1 model, which uses ‘chain of thought’ logic to solve problems by talking itself through a series of reasoning steps. Some specialists think that o3 might be producing a series of different chains of thought to help whittle down the best answer from a range of options.

 

Spending more time refining an answer at test time makes a huge difference to the results, says Chollet, who is now based in Seattle, Washington. But o3 comes at a massive expense: to tackle each task in the ARC-AGI test, its high-scoring mode took an average of 14 minutes and probably cost thousands of dollars. (Computing costs are estimated, Chollet says, on the basis of how much OpenAI charges customers per token or word, which depends on factors including electricity usage and hardware costs.) This “raises sustainability concerns”, says Xiang Yue at Carnegie Mellon University in Pittsburgh, Pennsylvania, who studies large language models (LLMs) that power chatbots.

Generally smart

 

Although the term AGI is often used to describe a computing system that meets or surpasses human cognitive abilities across a broad range of tasks, no technical definition for it exists. As a result, there is no consensus on when AI tools might achieve AGI. Some say the moment has already arrived; others say it is still far away.

 

Many tests are being developed to track progress towards AGI. Some, including Rein’s 2023 Google-Proof Q&A2, are intended to assess an AI system’s performance on PhD-level science problems. OpenAI’s 2024 MLE-bench pits an AI system against 75 challenges hosted on Kaggle, an online data-science competition platform. The challenges include real-world problems such as translating ancient scrolls and developing vaccines3.

Before and after: An example of a test where the user is meant to extrapolate a diagonal line that rebounds from a red wall. ARC-AGI, a test intended to mark the progress of artificial-intelligence tools towards human-level reasoning and learning, shows a user a set of before and after images. It then asks them to infer the 'after' state for a new 'before' image.

Good benchmarks need to sidestep a host of issues. For instance, it is essential that the AI hasn’t seen the same questions while being trained, and the questions should be designed in such a way that the AI can’t cheat by taking shortcuts. “LLMs are adept at leveraging subtle textual hints to derive answers without engaging in true reasoning,” Yue says. The tests should ideally be as messy and noisy as real-world conditions while also setting targets for energy efficiency, he adds.

 

Yue led the development of a test called the Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (MMMU), which asks chatbots to do university-level, visual-based tasks such as interpreting sheet music, graphs and circuit diagrams4. Yue says that OpenAI’s o1 holds the current MMMU record of 78.2% (o3’s score is unknown), compared with a top-tier human performance of 88.6%.

 

The ARC-AGI, by contrast, relies on basic skills in mathematics and pattern recognition that humans typically develop in early childhood. It provides test-takers with a demonstration set of before and after designs, and asks them to infer the ‘after’ state for a novel ‘before’ design (see ‘Before and after’). “I like the ARC-AGI test for its complementary perspective,” Yue says.

 

Prize performance

 

High scores on the ARC-AGI crept up from just 21% in 2020 to 30% in 2023.

 

Although in December o3 beat the 85% score set by the US$600,000 2024 ARC Grand Prize — a contest sponsored by the non-profit ARC Prize Foundation set up by Chollet and Mike Knoop — it exceeded the cost limit.

 

‘In awe’: scientists impressed by latest ChatGPT model o1

 

Interestingly, it also failed to solve a handful of questions that humans consider straightforward; Chollet has put out a call to the research community to help determine what distinguishes solvable from unsolvable tasks.

 

He will be introducing a more difficult test, ARC-AGI-2, by March. His early experiments suggest that o3 would score under 30%, whereas a smart human would score over 95% easily. And a third version of the test is in the works that will up the ante by evaluating AI’s ability to succeed at short video games, Chollet says.

 

The next big frontier for AI tests, Rein says, is the development of benchmarks for evaluating AI systems’ ability to act as ‘agents’ that can tackle general requests requiring many complex steps that don’t have just one correct answer. “All the current benchmarks are based on question and answer,” he says. “This doesn’t cover a lot of things in [human] communication, exploration and introspection.”

 

As AI systems improve, it is becoming harder and harder to develop tests that highlight a difference between human and AI capabilities. That challenge is, in itself, a good test for AGI, Chollet wrote in December on the ARC Prize Foundation blog.

 

“You’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”” [1]

 

1. Nature 637, 774-775 (2025) By Nicola Jones

 

 

Išlikimas tiems, kurie yra maloniausi: ar evoliuciją suprantame neteisingai?


„Kaip žmonės, gyvūnai ir net vienaląsčiai organizmai bendradarbiauja, kad išgyventų, rodo, kad gyvenime yra daugiau, nei vien konkurencija, teigiama džiuginančiame evoliucinės biologijos tyrime.

 

 

Savanaudiški genai socialiniams gyvūnams: bendradarbiaujanti gyvybės istorija. Jonathan Silvertown. Oxford Univ. Press (2024)

 

 

Faktas, kad visa gyvybė išsivystė natūraliosios atrankos dėka, gali turėti slegiančių atspalvių. Jei „stipriausiųjų išlikimas“ yra evoliucijos raktas, ar žmonės yra užprogramuoti konfliktuoti vienas su kitu? Visai ne, savo naujausioje knygoje „Savanaudiški genai socialiniams gyvūnams“ teigia evoliucinės biologijos specialistas Jonathan Silvertown. Priešingai, jis teigia, kad daugelis gamtos pasaulio reiškinių, nuo tam tikrų rūšių plėšrūnų iki parazitizmo, priklauso nuo bendradarbiavimo. Taigi „mums nebereikia nerimauti, kad žmogaus prigimtis yra nuodėminga, ar bijoti, kad žmogaus gerumo pienas išseks“.

 

 

Silvertownas pateikia genų, bakterijų, grybų, augalų ir gyvūnų pavyzdžių, kad pabrėžtų, jog bendradarbiavimas gamtoje yra visur esantis. Pavyzdžiui, bakterijos, vadinamos rizobijomis, klesti ankštinių augalų šaknų mazgeliuose ir azotą iš oro paverčia tirpia forma, kurią augalai gali panaudoti. Kai kurie vabalai bendradarbiauja, kad užkastų gyvūnų lavonus, kurie būtų per dideli, kad vienas vabzdys galėtų juos suvaldyti vienas, taip sumažindami riziką, kad kiti gyvūnai pavogs maistą, ir suteikdami lizdą vabalų šeimoms gyventi.

 

Ir daugelis bakterijų praneša apie savo buvimą viena kitai, naudodamos cheminę signalizacijos sistemą, vadinamą kvorumo jutimu, kuri veikia tik tada, kai tos pačios rūšies nariai yra glaudžiai susiglaudę. Tai leidžia kiekvienai ląstelei pakoreguoti savo genų raišką taip, kad būtų naudinga grupės individams – pavyzdžiui, išskirti nuodus, kad sunaikintų kitas rūšis, kai susitelkia pakankamai bakterijų, kad būtų galima surengti tinkamą ataką.

 

Net XVIII amžiaus piratavimas, pasak Silvertowno, yra geras veiksmingo bendradarbiavimo pavyzdys. Piratai dirbo kartu jų laivuose ir dažniau naudojo smurtą prieš pašalinius asmenis, o ne kaip vidinį teisėsaugos mechanizmą.

 

Autorius prieštarauja minčiai, kad bendradarbiavimas iš esmės prieštarauja konkurencijai – požiūriui, kuris atsirado dėl aštuntojo dešimtmečio sociobiologijos judėjimo, kuriame kai kurie biologai teigė, kad visas žmogaus elgesys yra redukuojamas iki darvinistinio poreikio būti „labiausiai prisitaikiusiu“. Realybė, kaip rodo Silvertownas, nėra juodai balta.

 

Perspektyvos klausimas

 

Pavyzdžiui, kerpės – „sudėtiniai organizmai“, kuriuose dumblis arba melsvabakterė gyvena grybo viduje. Šveicarų botanikas Simonas Schwendeneras, atradęs šį ryšį XIX a. 7-ajame dešimtmetyje, teigė, kad kerpė yra parazitas: „Jos vergai yra žalieji dumbliai, kuriuos ji surado arba netgi pasivijo ir privertė jai tarnauti.“ Kitas būdas pažvelgti į šį ryšį yra toks: šie dumbliai ir grybai yra tarpusavyje priklausomi – kai jie egzistuoja kartu, kaip kerpė, kiekvienas auga geriau, nei pavieniui. Riba tarp parazitizmo ir mutualizmo, konkurencijos ir bendradarbiavimo nėra aiški. Tai požiūrio klausimas.

 

 

 

Panašiai miglotos ribos yra ir mūsų pačių ląstelių biologijoje. Prieš daugiau, nei milijardą metų ląstelės absorbavo bakterijas, kurios galiausiai išsivystė į struktūras, vadinamas mitochondrijomis, kurios gamina energiją. Mitochondrijos yra esminė visų šiandien gyvų augalų, gyvūnų ir grybų ląstelių dalis. Jas galima laikyti vergėmis, o ląsteles – parazitais. Arba, galbūt, jos labiau panašios į įvaikintus šeimos narius.

 

 

Iš esmės, Silvertownas siūlo, bendradarbiavimas kiekvienoje iš šių situacijų kyla iš savanaudiškumo. Gyvūnai evoliucionavo ne tam, kad veiktų savo rūšies labui, o tam, kad skleistų savo genus. Bendradarbiavimas vyksta todėl, kad abipusė nauda, ​​biologiškai kalbant, yra geresnė, nei darbas vienam, kaip veiksmingai parodo kerpių atvejis.

 

 

Jei tai atrodo beširdiška, tai atspindi žmogaus polinkį taikyti žmogiškuosius moralinius principus biologiniams reiškiniams. Emociškai įkrautų žodžių, tokių, kaip „vergas“ ir „įvaikintas“, vartojimas atitolina mus nuo griežto mokslo ir verčia biologinę sąveiką, vertinti kaip „gerą“ arba „blogą“, o ne kaip moraliai agnostinius, sandorio procesus, kokie jie iš tikrųjų yra.

 

Biologinių procesų antropomorfizavimas yra gili ir aktuali problema. Polinkis klaidingai manyti, kad veiksnumas gamtos pasaulyje yra, lengvai pagaunantys, spąstai – pagalvokite, kaip dažnai žmonės gali sakyti, kad, pavyzdžiui, toks virusas, kaip SARS-CoV-2 „nori“ būti perduodamas, arba kad skruzdėlės veikia „jų kolonijos labui“. Būčiau norėjęs daugiau išgirsti apie Silvertowno požiūrį į šią kategorinę klaidą. Tačiau kai kuriose vietose jaučiau, kad jis galėjo aiškiau išreikšti savo numanomą supratimą. Vietoj to jis kartais aukoja tą atsargumą dėl nereikalingų juokelių, pavyzdžiui, pastebėdamas, kad bakterijos „iš esmės yra vienišiai, mėgstantys linksmintis“.

 

Autorius taip pat galėtų daugiau kalbėti apie tai, kaip amoralumas, būdingas daugumai gamtos pasaulio, netaikomas žmonėms. Panašiai kaip ir kitiems organizmams, mūsų evoliucinis paveldas daro mus socialiais, bet ar tas socialumas yra „geras“, ar „blogas“, yra moralinis, o ne mokslinis klausimas. Šį skirtumą nuo kitų Silvertowno aprašytų bendradarbiavimo procesų būtų galima paaiškinti geriau.

 

 

„Egoistiški genai socialinėms būtybėms“ geriausiai atspindi ilgas, intriguojančias diskusijas apie bendradarbiavimo elgesio sudėtingumą gamtos pasaulyje. Pavyzdžiui, nors daug skaičiau apie biologiją, prieš perskaitydamas šią knygą niekada negalėjau suprasti, kaip RNR grandinės galėjo susijungti ir pradėti savęs replikacijos procesą, per kurį išsivystė visa gyvybė. Silvertownas gali taip pat lengvai kalbėti apie junginius, sudarančius jūsų genus, kaip dauguma žmonių gali kalbėti apie vakarykštes futbolo rungtynes.“ [1]

 

1. Nature 628, 260-261 (2024) By Jonathan R. Goodman

Survival of the nicest: have we got evolution the wrong way round?


“How humans, animals and even single-celled organisms cooperate to survive suggests there’s more to life than just competition, argues a cheering study of evolutionary biology.

 

Selfish Genes to Social Beings: A Cooperative History of Life Jonathan Silvertown Oxford Univ. Press (2024)

 

The fact that all life evolved thanks to natural selection can have depressing connotations. If ‘survival of the fittest’ is the key to evolution, are humans hardwired for conflict with one another? Not at all, says evolutionary biologist Jonathan Silvertown in his latest book, Selfish Genes to Social Beings. On the contrary, he argues, many phenomena in the natural world, from certain types of predation to parasitism, rely on cooperation. Thus “we need no longer fret that human nature is sinful or fear that the milk of human kindness will run dry”.

 

Silvertown uses examples from genes, bacteria, fungi, plants and animals to emphasize that cooperation is ubiquitous in nature. For instance, bacteria called rhizobia thrive in the root nodules of legumes — and turn nitrogen from the air into a soluble form that the plants can use. Some beetles cooperate to bury animal corpses that would be too large for any single insect to manage alone, both reducing the risk of other animals stealing food and providing a nest for beetle families to live in.

 

 

And many bacteria indicate their presence to each other using a chemical-signalling system called quorum sensing, which is active only when members of the same species are tightly packed together. This allows each cell to adjust its gene expression in a way that benefits the individuals in the group — to release a poison to kill other species, for instance, when enough bacteria are clustered together to mount a decent attack.

 

Even eighteenth-century piracy, says Silvertown, is a good example of effective cooperation. Pirates worked together on their ships, and used violence more often against outsiders than as an internal mechanism for law enforcement.

 

The author argues against the idea that cooperation is fundamentally at odds with competition — a view that emerged as a consequence of the sociobiology movement of the 1970s, in which some biologists argued that all human behaviour is reducible to a Darwinian need to be the ‘fittest’. The reality, as Silvertown shows, is not black and white.

 

A matter of perspective

 

Take lichens, for instance — ‘composite organisms’ in which an alga or cyanobacterium lives within a fungus. The Swiss botanist Simon Schwendener, who discovered this relationship in the 1860s, argued that a lichen is a parasite: “Its slaves are green algals, which it has sought out or indeed caught hold of, and forced into its service.” Another way to view the relationship is that these algae and fungi are co-dependent — when they co-exist as a lichen, each grows better than it would alone. The line between parasitism and mutualism, competition and cooperation is not clear cut. It’s a matter of perspective.

 

Similarly hazy boundaries are found in the biology of our own cells. More than a billion years ago, cells absorbed bacteria, which eventually evolved into structures called mitochondria that generate energy. Mitochondria are an essential part of the cells of all plants, animals and fungi alive today. They could be considered slaves, with cells the parasites. Or perhaps they are more like adopted family members.

 

Fundamentally, Silvertown proposes, cooperation in each of these situations stems from selfishness. Animals did not evolve to act for the benefit of their species, but to spread their own genes. Cooperation happens because mutual benefits are better, biologically speaking, than working alone, as the case of lichens effectively demonstrates.

 

If this seems heartless, it’s a reflection of the human tendency to apply human moral frameworks to biological phenomena. The use of emotionally charged words such as ‘slave’ and ‘adopted’ takes us away from rigorous science and leads us to see biological interactions as ‘good’ or ‘bad’, rather than as the morally agnostic, transactional processes that they truly are.

 

Why reciprocity is common in humans but rare in other animals

 

The anthropomorphizing of biological processes is a deep and current problem. The tendency to falsely imply agency in the natural world is an easy trap to fall into — consider how often people might say that a virus such as SARS-CoV-2 ‘wants’ to be transmitted, for instance, or that ants act ‘for the good of their colony’. I would have liked to hear more about Silvertown’s views on this category error. But in places, I felt that he could have made his implied understanding more explicit. Instead, he sometimes sacrifices that carefulness for unnecessary jokes, noting, for instance, that bacteria “are essentially singletons who like to party”.

 

The author could also have talked more about how the amorality inherent in most of the natural world does not apply to humans. Similarly to other organisms, our evolutionary heritage makes us social, but whether that sociality is ‘good’ or ‘bad’ is a moral, not a scientific, question. This distinction from the other cooperative processes that Silvertown outlines could have been explained better.

 

Selfish Genes to Social Beings is at its best in the long, fascinating discussions of the complexity of cooperative behaviours across the natural world. For instance, although I’ve read a lot about biology, before reading this book I could never understand how RNA chains might have joined together and started the process of self-replication through which all life evolved. Silvertown can talk as easily about the compounds making up your genes as most people can about yesterday’s football match.” [1]

 

1. Nature 628, 260-261 (2024) By Jonathan R. Goodman

Kodėl nauji prisiminimai neužgožia senųjų? Miego mokslas duoda užuominų

 

„Tyrimai su pelėmis rodo mechanizmą, kuris padeda išvengti „katastrofiško užmiršimo“.

 

Atsirado naujų užuominų, atskleidžiančių paslaptį, kaip smegenys išvengia „katastrofiško užmiršimo“ – anksčiau sukurtų prisiminimų iškraipymo ir perrašymo, kai sukuriami nauji.

 

Tyrimų komanda nustatė, kad bent jau pelių smegenys naujus ir senus prisiminimus apdoroja atskiromis miego fazėmis, o tai gali užkirsti kelią jų maišymuisi. Darant prielaidą, kad šis atradimas pasitvirtins ir kituose gyvūnuose, „sudarau visus savo pinigus į tai, kad ši segregacija įvyks ir žmonėms“, – sako György Buzsáki, sistemų neurologas iš Niujorko universiteto Niujorke. Taip yra todėl, kad atmintis yra evoliuciškai sena sistema, sako Buzsáki, kuris nebuvo tyrimų komandos narys, bet kažkada prižiūrėjo kai kurių jos narių darbą.

 

Mokslininkai jau seniai žino, kad miego metu smegenys „atkuria“ naujausią patirtį: tie patys neuronai, kurie buvo susiję su patirtimi, suveikia ta pačia tvarka. Šis mechanizmas padeda įtvirtinti patirtį kaip atmintį ir paruošti ją ilgalaikiam saugojimui.

 

Tirti smegenis miego metu veikiančias funkcijas, tyrėjų komanda pasinaudojo pelių ypatybe: kai kuriais miego etapais jų akys yra iš dalies atmerktos. Komanda stebėjo po vieną kiekvienos pelės akį jai miegant. Gilaus miego fazės metu tyrėjai stebėjo, kaip vyzdžiai susitraukia, o vėliau grįžta į pradinį, didesnį dydį, kiekvienas ciklas trunka maždaug minutę. Neuronų įrašai parodė, kad dauguma smegenų patirčių pakartojimų vyko, kai gyvūnų vyzdžiai buvo maži.

 

 

Miego praradimas sutrikdo kvapų atmintį, rodo kirminų tyrimai

 

 

Tai paskatino mokslininkus susimąstyti, ar vyzdžio dydis ir atminties apdorojimas yra susiję. Norėdami tai išsiaiškinti, jie pasitelkė techniką, vadinamą optogenetika, kuri naudoja šviesą genetiškai modifikuotų neuronų smegenyse elektriniam aktyvumui suaktyvinti arba slopinti. Pirmiausia jie treniravo modifikuotas peles rasti ant platformos paslėptą saldumyną. Iškart po šių pamokų, pelėms miegant, autoriai panaudojo optogenetiką, kad sumažintų neuronų sužadinimo pliūpsnius, kurie buvo siejami su pakartojimu. Jie tai darė tiek mažo, tiek didelio vyzdžio miego stadijose.

 

 

Pabudusios pelės visiškai pamiršo skanėsto vietą. – bet tik tuo atveju, jei impulsų kiekis buvo sumažintas mažo vyzdžio fazėje. „Mes ištrynėme atmintį“, – sako Wenbo Tangas, straipsnio „Nature“ bendraautoris ir sistemų neurologas iš Kornelio universiteto Itakoje, Niujorke.

 

 

Priešingai, kai komanda sumažino neuronų impulsų kiekį didelio vyzdžio fazės metu netrukus po pamokos, pelės iškart nuėjo prie skanėsto – tai aiškiai parodė, kad jų švieži prisiminimai išliko.

 

 

Sprogimas iš praeities

 

 

Kiti komandos atlikti eksperimentai parodė, kad didelio vyzdžio miego fazė turi savo funkciją: ji padeda apdoroti nusistovėjusius prisiminimus, o tai pelėms reiškia tuos, kurie susiformavo kelias dienas prieš snaudimą, o ne tuos, kurie buvo tą pačią dieną.

 

 

Neurobiologas, tyrinėjantis, kaip smegenys keičiasi senstant

 

 

„Smegenys išsaugojo senesnius prisiminimus šio didelio vyzdžio pobūsenos metu, bet integravo naujus prisiminimus mažo vyzdžio pobūsenos metu“, – sako bendraautorė Azahara Oliva, Kornelio universiteto fizikė. Ši dviejų fazių sistema yra „galimas šios problemos, kaip smegenys gali integruoti naujus, sprendimas“. žinias, bet ir išlaikyti senas žinias nepažeistas“.

 

Straipsnyje žengiamas „labai svarbus žingsnis“, – teigia Maksimas Baženovas, Kalifornijos universiteto San Diege sistemų neurologas, kuris nedalyvavo tyrime. Jame parodyta, kad nusistovėjusių ir naujų prisiminimų tvarkymas „nėra visiškai sumaišytas, o tai galėtų sukelti trukdžius, o yra gražiai atskirtas laike“.

 

Katastrofiškas užmiršimas taip pat veikia dirbtinius neuroninius tinklus – tai smegenų pagrindu sukurti algoritmai, kuriais grindžiami daugelis šiuolaikinių dirbtinio intelekto (DI) įrankių. Įžvalgos apie tai, kaip smegenys išvengia šios problemos, gali įkvėpti algoritmus, kurie taip pat galėtų padėti DI modeliams jos išvengti, sako Tangas.“ [1]

 

 

1. Nature 637, 524-525 (2025) Traci Watson

Why don’t new memories overwrite old ones? Sleep science holds clues

 

“Research in mice points towards a mechanism that avoids ‘catastrophic forgetting’.

 

New clues have emerged in the mystery of how the brain avoids ‘catastrophic forgetting’ — the distortion and overwriting of previously established memories when new ones are created.

 

A research team has found that, at least in mice, the brain processes new and old memories in separate phases of sleep, which might prevent mixing between the two. Assuming that the finding is confirmed in other animals, “I put all my money that this segregation will also occur in humans”, says György Buzsáki, a systems neuroscientist at New York University in New York City. That’s because memory is an evolutionarily ancient system, says Buzsáki, who was not part of the research team but once supervised the work of some of its members.

 

Scientists have long known that, during sleep, the brain ‘replays’ recent experiences: the same neurons that were involved in an experience fire in the same order. This mechanism helps to solidify the experience as a memory and prepare it for long-term storage.

 

To study brain function during sleep, the research team exploited a quirk of mice: their eyes are partially open during some stages of slumber. The team monitored one eye in each mouse as it slept. During a deep phase of sleep, the researchers observed the pupils shrink and then return to their original, larger size repeatedly, with each cycle lasting roughly one minute. Neuron recordings showed that most of the brain’s replay of experiences took place when the animals’ pupils were small.

 

Sleep loss impairs memory of smells, worm research shows

 

That led the scientists to wonder whether pupil size and memory processing are linked. To find out, they enlisted a technique called optogenetics, which uses light to either trigger or suppress the electrical activity of genetically engineered neurons in the brain. First, they trained engineered mice to find a sweet treat hidden on a platform. Immediately after these lessons, as the mice slept, the authors used optogenetics to reduce bursts of neuronal firing that have been linked to replay. They did so during both the small-pupil and large-pupil stages of sleep.

 

Once awakened, the mice had completely forgotten the location of the treat — but only if firing had been reduced during the small-pupil stage. “We wiped out the memory,” says Wenbo Tang, a co-author of the Nature paper and a systems neuroscientist at Cornell University in Ithaca, New York.

 

By contrast, when the team reduced bursts of neuronal firing during the large-pupil phase shortly after a lesson, the mice went straight to the treat — making clear that their fresh memories were intact.

Blast from the past

 

Other experiments by the team showed that the large-pupil phase of sleep has its own function: it helps to process established memories, which in mice means those that formed in the few days before a snooze, rather than those from the same day.

 

The neurobiologist studying how the brain changes as it ages

 

“The brain was preserving the older memories during this large-pupil sub-state, but incorporating new memories during the small-pupil sub-state,” says co-author Azahara Oliva, a physicist at Cornell University. This two-phase system is a “possible solution to this problem of how the brain can incorporate new knowledge but also maintain the old knowledge intact”.

 

The paper takes “a very important step”, says Maksim Bazhenov, a systems neuroscientist at the University of California, San Diego, who was not involved in the research. It shows that the handling of established memories and new memories “is not all mixed up, which could potentially lead to interference, but instead [is] nicely separate in time”.

 

Catastrophic forgetting also affects artificial neural networks, which are algorithms modelled on the brain and underlie many of today’s artificial intelligence (AI) tools. Insights into how the brain avoids this problem might inspire algorithms that can be used to help AI models avoid it as well, Tang says.” [1]

 

1. Nature 637, 524-525 (2025) Traci Watson