UW Computer Security Research and Course Blog

Security Review: CAPTCHA Systems

By angel at 11:58 pm on February 10, 2008 | 4 Comments

Summary

A CAPTCHA System is a Completely Automatic Public Turing Test to Tell Computers and Humans Apart.

Initially developed by Carnegie Mellon researchers, this system was mean to differentiate between actual people and automated robots when it comes to opening new accounts (email accounts, eBay accounts, bank accounts…). A CAPTCHA is an image made of words and numbers that are shifted, added different fonts, added colors, shades, and slightly blurred but still readable for the human eye, to avoid that spammers open accounts in a automated way.

Dan Hubbard, Vice-president of WebSense, reported recently that Microsoft’s CAPTCHA system used by every Windows Live site has been compromised. It has been reported that bots are obtaining a 35% rate of success, with the capabilities to register hundreds of new users per minute using automated HTTP queries via raw sockets. These ‘virgin’ accounts are used for a short period of time (before getting blacklisted) to send SPAM by email or Virus to ‘recruit’ more botnet zombies. Yahoo CAPTCHA system has been reportedly hacked a few weeks ago as well, by a Russian researcher.

Assets & Security Goals

A common type of CAPTCHA requires that the user type the letters of a distorted image, sometimes with the addition of an obscured sequence of letters or digits that appears on the screen. Other CAPTCHAS try to make segmentation difficult by adding an angled line, or crowding symbols that can be read by humans but not segmented by bots.

CAPTCHAs are used to prevent automated software from performing actions which degrade the service of internet systems, whether due to abuse or resource expenditure. CAPTCHAs are most often deployed as a response to encroachment by commercial interests (avoid registering accounts in any major website). They also protect automated posting to blogs, forums and wikis and stop commercial promotion, harassment and vandalism and potential chaos in those forums. Typically the CAPTCHA system is only presented at the registration phase, so once the ‘bot’ clears the registration step, he ‘obtains’ human privileges from the system standpoint.

Adversaries and Threats

Decoders of these CAPTCHA systems are mostly used by spammers nowadays, to register Yahoo e-mail accounts to break through AntiSpam systems and Bayesian filters that rely on blacklisted email databases, or manipulate polls in publicly available registration pages, or even manipulate the outcome of a bid at eBay in an automated and beneficial way for the attacker.

Once introduced, these bots can potentially produce chaos in any online community as well by messing up with the community and performing disturbing automated actions. By using commercial publicly available email systems, they ensure that the Mail DNS (MX) servers will never get blacklisted, and the lifespan of an individual email address until it gets blacklisted is enough for a few thousands of emails to be sent. Typically it’s not necessary to achieve a high degree of accuracy when designing automated recognition software, since an accuracy of 15% is enough when attacker is able to run 100,000 tries per day.

Weaknesses

When it comes to attack CAPTCHA systems, there are four major techniques widely used by hackers:

1. Leveraging insecure implementation: Most CAPTCHAs don’t destroy the session when the correct phrase is entered. So by reusing the session id of a known CAPTCHA image, it is possible to automate requests to a CAPTCHA-protected page. This flaw is known to be exploitable in a significant number of free and commercial CAPTCHA scripts. Simply by recording a session ID and the captcha plaintext, an attacker will be able to resend the same session ID and the same CAPTCHA plaintext any number of times, changing the user data and thus being able to register thousands of accounts.

2. Using Optical Character Recognition software (OCR) when the distortion can be reversed-engineered because of a common pattern. This is done by extracting the image from the web page, removing the background clutter with color filters and thin line detection and splitting the image into regions each containing a single letter. This is finally piped to the OCR batch software that identifies the letter for each region and returns the actual word.

3. Some CAPTCHA implementations use a hash (such as an MD5 hash) with a limited number of words, and the solution as a key is passed to the client to validate the CAPTCHA. Often the CAPTCHA is of small enough size that this hash could be cracked with Rainbow Tables in a matter of minutes.

4. Human solvers: Last year, spammers used a virtual stripper as bait to dupe people into helping criminals crack CAPTCHA codes. Security researchers warned that a series of photographs show a woman with progressively fewer clothes and more skin each time the user correctly enters the characters in an accompanying CAPTCHA codes. Other variations used fake adult websites with the same purpose: have human victims that come across these pages identify the cross-site CAPTCHAS for the bots. With enough traffic, the attacker can get a solution to the CAPTCHA puzzle in time to relay it back to the target site.

Defenses

Major technology vendors are aware of these attempts toward automated solutions for CAPTCHA images, and are working on improvements to the system. The only way to prevent a repeat of the image spam surge as new models using AI come to light, will be for technology vendors and their customers to abandon the current filtering-heavy approach.

Instead, they should elaborate a more complex filtering schema that detects repeated user account creation queries based on common patterns. Namely, filtering can be implemented based on neural networks that detect multiple queries coming from the same address in a short window of time, same user agent, same HTTP protocol patterns, same repeated HTTP headers, same user account name patterns based on random USER**** regular expressions, etc. Acoustic CAPTCHA attempts can also add some increased security to this problem, since the computational problem for voice recognition can become exponential for these bots, if properly done. An alternative would be a two-way-handshake, by asking a user to provide an answer (not included within the audio payload) to a given question by a text to speech technology.

Finally, for MD5-based hashing CAPTCHAs, a more secure scheme would use an HMAC.

Conclusions

We have seen four ways in how an attacker can compromise today’s CAPTCHA systems for whatever purpose. Developers face the requirement to make something that’s simple and easy enough for people to accept, but too difficult for a computer to parse. However, both Yahoo and Microsoft CAPTCHA systems have been reportedly hacked with a high success ratio, making it possible to automate the account creation process for bots.

Even feasible solutions like identifying patterns based on IP addresses become difficult as more bots use proxies, random user agents, traffic randomization patterns and hacked Unix shells to launch their registration process in a way that the Web Server can’t tell apart a Perl script from a human. Apparently for the hackers, where there’s a will, there’s a way.

Filed under: Security Reviews — 4 Comments »

4 Comments

1

Comment by cbhacking

February 11, 2008 @ 2:15 am

Nice post. I’d always wondered what CAPTCHA stood for (but not enough to look it up, *sigh*). I hadn’t realized that both MS and Yahoo! CAPTCHA systems had been broken; I’d though it was only Yahoo!’s.

However, PLEASE remember to use the more button! When browsing, summaries are your friend and page-plus single posts are not.
2

Comment by Kris Plunkett

February 12, 2008 @ 7:32 pm

Yes, please use the more feature…
3

Comment by joyleung

February 17, 2008 @ 5:09 pm

Good job on the writing such a thorough review. Everything looks well informed. Another asset you might consider in CAPTCHAs is also the availability. Recently, I’ve found that the images have gotten so visually distorted and complex (‘u’ can often look a lot like ‘v’) that I have to try a couple of times to actually figure it out. This can be problematic is OCR gets better and CAPTCHAs become more visually complex to match it.
4

Comment by Amir Hossain

March 25, 2008 @ 2:05 am

Hi!!! Hope you are doing well. We the leading Data processing company in Bangladesh. Presently we are processing 300000+ captcha per day by our 55 operators. We have a well set up and We can give the law rate for the captcha solving.

Our rate $2 per 1000 captcha.

We just wanna make the relationship for long terms. can we go forward? Thank you, (For inquiry amir4@yours.com or
khoknaa@yahoo.com)

Best Regards
Amir Hossain Dewan
Data Home Ltd.
amir4@yours.com
khoknaa@yahoo.com