Feed aggregator

Chicago Kare by Duane King

Hacker News - Wed, 11/20/2024 - 5:32am

Article URL: https://chicagokare.xyz/

Comments URL: https://news.ycombinator.com/item?id=42192568

Points: 1

# Comments: 0

Categories: Hacker News

Against Tricky Questions for LLMs: A Case for Simple and Transparent Benchmarks

Hacker News - Wed, 11/20/2024 - 5:31am

Assessing the reasoning capabilities of large language models (LLMs) poses a significant challenge, particularly in distinguishing reasoning from memorization.

For instance, when an LLM answers "2 + 2 = 4," it relies on training data repetition rather than an understanding of arithmetic. This behavior parallels Daniel Kahneman’s "System 1" thinking—fast and reflexive.

Yet, with more complex tasks, such as adding large numbers or solving multi-step puzzles, LLMs typically fail unless they can access external tools.

This inability to shift to "System 2" thinking—slow, deliberate reasoning—remains a fundamental limitation.

Vendors have addressed this by integrating tools like calculators -- an useful addition that works around the inability of LLMs to reason.

But how can progress be accurately measured if simple reasoning tasks are replaced with tools?

## Tricky Questions: A Flawed Metric

To overcome this challenge, researchers have crafted "tricky" questions designed to test reasoning, such as:

> "You have 3 apples, and I give you 2 more—but one is much smaller. How many apples do you have?"

An LLM might misinterpret the detail about size as a cue to exclude the smaller apple. While such tests highlight weaknesses, they mainly probe linguistic ambiguity rather than reasoning. Moreover, as vendors train models to handle these patterns, the tests lose diagnostic value.

Instead, we propose focusing on straightforward tasks requiring deliberate reasoning, which cannot be solved through pattern recognition.

## A Reasoning Benchmark Framework

*Effective evaluation demands benchmarks that are clear, simple, and tool-free*.

We propose the following milestones:

1. *Basic Arithmetic Competence*: A reasoning model should reliably compute sums, products, or powers for large numbers without external tools.

2. *Execution of Simple Algorithms*: The model should be able to perform basic algorithmic tasks, such as sorting a list, computing a factorial, or simulating a logical circuit without external tools.

3. *Structured Puzzles*: Tasks like sudoku or nonograms without external tools.

4. *Strategic Gameplay*: Games such as tic-tac-toe, checkers, or chess without external tools.

5. *Novel Problem Solving*: Finally, a capable reasoning system should propose original solutions to well-defined mathematical or logical problems. Generating new proofs or contributing insights to unsolved problems would demonstrate a high degree of reasoning aptitude.

These benchmarks establish a baseline for reasoning but do not imply artificial general intelligence (AGI).

At the same time, we can use these benchmarks to discard claims that LLMs are somehow "close" to AGI.

## External Tools and Transparency

Proprietary LLMs often integrate tools to enhance performance, but this prevents evaluation of the models.

To ensure fair assessment, vendors should provide a way to disable tools during evaluations.

## Simplicity as a Strength

Critics may argue that simple benchmarks fail to capture real-world complexity. Yet, as shown by arithmetic, simplicity can illuminate reasoning processes without sacrificing rigor.

Straightforward tasks like multi-step computations and logical puzzles reveal essential reasoning skills without relying on tricky or convoluted questions.

## Conclusion

Evaluating reasoning in LLMs does not require convoluted tests. Transparent, tool-free benchmarks grounded in deliberate problem-solving provide a clearer measure of progress. By focusing on tasks that demand "System 2" thinking, we can set meaningful milestones for development.

No LLM should be deemed closer to AGI if it cannot solve simple reasoning problems independently. Transparency and simplicity are essential for advancing our understanding of these systems and their potential.

Comments URL: https://news.ycombinator.com/item?id=42192562

Points: 2

# Comments: 0

Categories: Hacker News

Earn More Than Twice the National Average With These Top Accounts Today's CD Rates, Nov. 20, 2024

CNET Feed - Wed, 11/20/2024 - 5:30am
The clock is ticking on APYs as high as 4.75%.
Categories: CNET

BasedFlare – Sovereign DDoS Protection

Hacker News - Wed, 11/20/2024 - 5:21am

Article URL: https://basedflare.com/#

Comments URL: https://news.ycombinator.com/item?id=42192484

Points: 1

# Comments: 0

Categories: Hacker News

High Savings APYs Won't Stick Around Long -- Don't Delay to Get a Good Rate. Today's Rates, Nov. 20, 2024

CNET Feed - Wed, 11/20/2024 - 5:00am
Stashing money in a high-yield savings account can help you grow your nest egg. Rates may be falling soon.
Categories: CNET

The 2024 Roku Ultra 4K Streaming Device Is $1 Off Its Best-Ever Price Ahead of Black Friday

CNET Feed - Wed, 11/20/2024 - 4:51am
Early Black Friday sales for Roku are taking off with Amazon offering a 20% discount on this recently-released streamer.
Categories: CNET

Attention Gamers: The Lenovo Legion Go Is 22% Off, Even Before Black Friday

CNET Feed - Wed, 11/20/2024 - 4:39am
If you've been wanting to experience handheld PC gaming at its best, the 1TB Legion Go is now 22% off as an early Black Friday deal.
Categories: CNET

AI Granny Daisy takes up scammers’ time so they can’t bother you

Malware Bytes Security - Wed, 11/20/2024 - 4:31am

A mobile network operator has called in the help of Artificial Intelligence (AI) in the battle against phone scammers.

Virgin Media O2 in the UK has built an AI persona called Daisy with the sole purpose of keeping scammers occupied for as long as possible. Basically, until the scammers give up, because Daisy won’t.

Daisy uses several AI models that work together listening to what scammers have to say, and then responding in a lifelike manner to give the scammers the idea they are working on an “easy” target. Playing on the scammers’ biases about older people, Daisy usually acts as a chatty granny.

According to Virgin Media O2’s press release Daisy has successfully kept numerous fraudsters on calls for 40 minutes at a time. To achieve this “Granny Daisy” will tell the scammers all about her passion for knitting, her cat Fluffy, and provide exasperated callers with false personal information including made up bank details.

The idea behind Daisy is two-fold. Not only does it waste the scammers’ time—time they could have spent defrauding real people—but it also raises awareness, through posts such as this one, that the person you are talking to on the phone could be very different from what you imagine.

Raising awareness about how AI can be used to deceive people is necessary: We’ve reported about how scammers have used AI used to fake voices of loved ones in a “I’ve been in an accident” scam to warn others about the scam.

Virgin Media O2 research learned that 67% of Brits are concerned about being the target of fraud and 22% experience a fraud attempt every single week. The Federal Trade Commission (FTC) received fraud reports from 2.6 million consumers in 2023, with imposter scams the most commonly reported fraud category.

The criminals often pretend to work for your bank or a delivery company that needs a payment before they can deliver a package, with the end goal of the victim disclosing their banking details.

It’s too bad that Daisy can’t intercept the calls from the scammers. For now, the scammers will have to call one of the phone numbers that Daisy answers, which have cleverly been circulated on contact lists known to be used by scammers.

If you’d like to hear Daisy in action here is a video with some actual audio.

Daisy was set up with the help of one of YouTube’s best known scam baiters, Jim Browning. Behind the scenes there are several people that enjoy being a real life time waster, but they can only occupy so many because their time is limited.

We asked Tammy Stewart, one of Malwarebytes’ researchers, who has made it a hobby to waste the time of phishers herself, and she was enthusiastic about the idea of having a “Daisy.” In fact, she’d like to have several and she thinks they could be very effective.

We don’t just report on phone security—we provide it

Cybersecurity risks should never spread beyond a headline. Keep threats off your mobile devices by downloading Malwarebytes for iOS, and Malwarebytes for Android today.

Categories: Malware Bytes

Pages