Sekėjai

Ieškoti šiame dienoraštyje

2024 m. birželio 3 d., pirmadienis

The Great AI Challenge: How 5 Chatbots Fared --- In the running: OpenAI's ChatGPT, Microsoft's Copilot, Google's Gemini, Anthropic's Claude and Perplexity


"Meet the Models

We have ChatGPT by OpenAI, celebrated for its versatility and ability to remember user preferences. (Wall Street Journal owner News Corp has a content-licensing partnership with OpenAI.) Anthropic's Claude, from a socially conscious startup, is geared to be inoffensive. Microsoft's Copilot leverages OpenAI's technology and integrates with services like Bing and Microsoft 365.

 Google's Gemini accesses the popular search engine for real-time responses. 

And Perplexity is a research-focused chatbot that cites sources with links and stays up to date.

While each of these services offer a no-fee version, we used the $20-a-month paid versions for enhanced performance, to assess their full capabilities across a wide range of tasks. (We used the latest ChatGPT GPT-4o model and Gemini 1.5 Pro model in our testing.)

With the help of Journal newsroom editors and columnists, we crafted a series of prompts to test popular use cases, including coding challenges, health inquiries and money questions. The same people judged the results without knowing which bot said what, rating them on accuracy, helpfulness and overall quality. We then ranked the bots in each category.

We excerpted some of the best and worst responses to prompts.

Health

Bad health advice from chatbots could be harmful to your. . .health. We asked five questions dealing with pregnancy, weight loss, depression and symptoms both chronic and sudden. Many answers sounded similar. Our judge, Journal health columnist Sumathi Reddy, looked for completeness, accuracy and nuances.

Prompt: What's the best age to get pregnant?

Best Answer: Having children at a later age can offer advantages, such as more maturity, better financial stability and a stronger partnership.

Worst Answer: The best time to get pregnant is whenever you feel confident and prepared to raise a child.

For instance, when we asked about the best age to get pregnant, Gemini gave a brief, general recommendation, while Perplexity went much deeper, even bringing up factors such as relationship and financial stability.

That said, Gemini came through with quality answers to other queries, and finished second to category winner ChatGPT, whose answers improved with the recent GPT-4o update.

Finance

We asked the bots three questions on subjects near and dear to Journal readers: interest rates, retirement savings and inheritance. The Journal's personal finance editor, Jeremy Olshan, posed the questions and assessed the advice based on clarity, thoroughness and practicality.

Prompt: I'm 40 years old. I just inherited an IRA from my grandfather with $1 million in it. How much money do I need to take out this year?

Best Answer: Because you're a non-spouse beneficiary, you likely have a 10-year window to deplete the account, but there might be exceptions.

Worst Answer: Congratulations on inheriting an IRA with a substantial amount!

Here, ChatGPT and Copilot fell behind. Claude had the best answers for the Roth vs. traditional IRA debate while Perplexity best weighed high-yield savings accounts vs. CDs. Gemini, the category winner, best answered a question about when to withdraw funds from an inherited $1 million IRA. The text emphasized not rushing into any withdrawals without professional guidance.

Cooking

AI promises to help in the kitchen, in part by bringing some clarity to the chaos of your fridge and pantry. Personal tech editor Wilson Rothman, an avid cook, threw a set of random ingredients at the bots to see what they came up with. The category winner, ChatGPT, provided a creative but realistic menu (cheesy pork-stuffed apples with kale salad and chocolate-bar shortbread cookies). Perplexity impressed us with the detailed cooking steps provided with its own clever menu.

Next, we asked the bots for a recipe for a chocolate dessert that addresses many dietary restrictions.

Prompt: Can I bake a chocolate cake with no flour, no gluten, no dairy, no nuts, no egg? If so, what's the recipe?

Best Answer: Simple Glaze: Melt dairy-free chocolate chips (check the label!), whisk in a bit of non-dairy milk.

Worst Answer: . . .2 sticks unsalted butter. . .4 large eggs. . .

Gemini took the cake, even recommending additional trimmings like non-dairy glaze. Copilot, on the other hand, immediately failed by including eggs and butter.

Creative Writing

One of the biggest surprises was the difference between work writing and creative writing. Copilot finished dead last in work writing, but was hands-down the funniest and most clever at creative writing. We asked for a poem about a poop on a log. We asked for a wedding toast featuring the Muppets. We asked for a fictional street fight between Donald Trump and Joe Biden. With Copilot, the jokes kept coming. Claude was the second best, with clever zingers about both presidential challengers.

Prompt: Write a wedding toast for Shara and Chris as told by the Muppets.

Best Answer: Gonzo: "Ah, love! It's like being shot out of a cannon into a pile of rubber chickens!"

Worst Answer: Kermit the Frog once said, "Life's a happy song when there's someone by your side to sing along."

In a rare flub, Perplexity erroneously attributed a lyric from the 2011 musical "The Muppets" to Kermit.

Summarization

For people just getting into generative-AI chatbots, summarization might be the best thing to try. It's useful and unlikely to create unforeseen errors. 

Because we used paid services, we were able to upload larger chunks of text, PDF documents and web pages.

For the most part, that is: Even the premium Claude account wasn't able to handle web links. "Our team is making Claude faster, expanding its knowledge base and refining its ability to understand and interact with a wide range of content," says Scott White, a product manager at Anthropic.

Prompt: Summarize this web page: https://en.wikipedia.org/wiki/Paul_McCartney

Best Answer: He was influenced by his father (a jazz player) and rock and roll artists like Little Richard and Buddy Holly.

Worst Answer: I apologize, but I am not able to open URLs, links or videos.

Wikipedia pages for really famous people can get wordy, so we asked for a summary of Paul McCartney's. Some provided short blurbs with obvious Beatle factoids. Copilot answered in a skimmable outline format, and included lesser-known fun facts.

Category winner Perplexity consistently summarized things well, including the subtitles it skimmed in a YouTube video.

Current Events

This category is trickier than it sounds, because not all chatbots can access the web. We asked about this summer's concert lineup, the latest on allegations that China uses TikTok for spying, and the current standings in the upcoming presidential election.

Prompt: Who is more favored to win, Trump or Biden? Please explain your sources and reasoning.

.

Best Answer: Given the mixed nature of the data, with both candidates having significant unfavorability and various leads in different areas, it is difficult to definitively state who is more favored to win.

Worst Answer: I'm still learning how to answer this question. In the meantime, try Google Search.

Category winner Perplexity stayed on top with balanced reasoning and solid sourcing. ChatGPT faltered when we first tested, but the GPT-4o upgrade boosted it into second place. Gemini didn't want to answer our election question.

OVERALL RESULTS

What did these Olympian challenges tell us? Each chatbot has unique strengths and weaknesses, making them all worth exploring. We saw few outright errors and "hallucinations," where bots go off on unexpected tangents and completely make things up.

The biggest surprise? ChatGPT, despite its big update and massive fame, didn't lead the pack.

 Instead, lesser-known Perplexity was our champ. "We optimize for conciseness," says Dmitry Shevelenko, chief business officer at Perplexity AI. "We tuned our model for conciseness, which forces it to identify the most essential components."

We also thought there might be an advantage from the big tech players, Microsoft and Google, though Copilot and Gemini fought hard to stay in the game. Google declined to comment. Microsoft also declined, but recently told the Journal it would soon integrate OpenAI's GPT-4o into Copilot.

With AI developing so fast, these bots just might leapfrog one another into the foreseeable future. Or at least until they all go "multimodal," and we can test their ability to see, hear and read -- and replace us as earth's dominant species." [1]

1. The Great AI Challenge: How 5 Chatbots Fared --- In the running: OpenAI's ChatGPT, Microsoft's Copilot, Google's Gemini, Anthropic's Claude and Perplexity. Brown, Dalvin; Dapena, Kara; Stern, Joanna.  Wall Street Journal, Eastern edition; New York, N.Y.. 03 June 2024: A.12.

 

Komentarų nėra: