How to avoid the latest scraping barriers

How to avoid the latest scraping barriers

How to avoid the latest scraping barriers
by: Chase Davis
Houston Chronicle
Posted June 1, 2007

In early January, a brief thread on NICAR-L, a computer-assisted reporting Listserv operated by IRE and NICAR, highlighted a bizarre step taken by the Seattle Fire Department to shield its online data from Web scrapers.

Citing national security concerns, Seattle fire officials reformatted the public response data available on their Web site from scraper-friendly HTML into a JPEG image. The theory, according to the officials behind the decision, was that malevolent users could exploit patterns in the data for nefarious ends. The image conversion was supposed to stop them.

"Our intent is to enhance the safety of personnel and the public but still provide information about current emergencies in our community," department officials wrote on their Web site soon after the change.

Although their plan was short-lived, the Seattle Fire Department is not the only agency who has begun trying to shut out scrapers. Citing reasons from security to bandwidth concerns, agencies are erecting new barriers in order to thwart data-seekers, even though the datasets they pursue are often public and searchable online.

The countermeasures are often targeted at mash-up artists, who collect and repackage online data in order to create Web sites such as www.chicagocrime.org. Examples of these barriers are still rare, but they may soon pose problems for CAR specialist — particularly those who rely on automated data harvesting to regularly update internal and public-facing Web applications.

The methods employed have been varied — some sophisticated and some not. Fortunately, many of them, such as the ones listed below, can be overcome with a little courtesy and ingenuity:

IMAGES

It turns out that the Seattle scenario was relatively easy to defeat. As several bloggers and NICAR-L posters noted, Optical Character Recognition software, which pulls text from image files, was more than capable of extracting text from the image.

John Eberley, the Seattle mash-up artist who initially brought attention to the format change in October, said the shift to images seemed futile.

"It is an illusion of protection for them," he said in a Seattle Post-Intelligencer article. "If they are really worried about it, they should pull the whole thing off the Web entirely. I don't see any difference from this data compared to listening in on a scanner of police or fire calls."

Within months, and following stories in local newspapers and the blogosphere, the fire department reposted its data easily scrapable HTML.

CAPTCHA

The topsy-turvy, wavy letters you often see before you submitting a blog post or signing up for online accounts might be the most formidable scraping barrier constructed to date, and the Missouri State Highway Patrol has started using them.

CAPTCHAs dynamically create simple puzzles — in the Highway Patrol's case, a sequence of letters or numbers — that most machines would have trouble solving or recognizing. The intent is to make users prove they are human.

When the user completes the CAPTCHA they are granted a session cookie, which allows them to explore the data for a certain amount of time. When that time expires or the user closes their Web browser, the system creates another CAPTCHA test that must be passed before the user can log back in.

The Highway Patrol employed a simplified implementation of the CAPTCHA system last summer after heavy scraper traffic slowed their site to a crawl.

"While all that information was all open record, a bunch of (scrapers) would get on there at a time and really slow things down," Highway Patrol Spokesman Tim Hull said. "We implemented it to keep things running at a normal speed."

Although Hull said the CAPTCHA has stopped scrapers so far, computer security experts have warned that cracking the system is far from impossible. For example, some OCR software suites are capable of recognizing simple CAPTCHA patterns. In several cases, hackers have created modified OCR suites specifically designed to defeat CAPTCHAs. Some claim to work more than 90 percent of the time on certain pattern types.

In other cases, CAPTCHA session IDs have been be exploited and reused, particularly in situations where developers have been careless. Other methods are based on artificial intelligence and crowdsourcing, the practice of enlisting human volunteers to complete repetitive tasks a computer can't automate.

RATE LIMITING

A scraping script designed for rapid-fire hits on a server occasionally triggers alarms, warns Daniel Lathrop, CAR specialist at the Seattle Post-Intelligencer. Apart from hogging bandwidth, scraping scripts without built-in delays leave an obvious mark on server logs, which system administrators can use as evidence to shut scrapers out.

Dr. Dobb's Portal, (http://www.ddj.com) a Web site that covers software development, described it this way in a 2004 article: "We know that a human user driving a browser can only make a small number of requests per minute, so logic on the server that detects too many requests per minute could presume that screen scraping is taking place and prevent access from the offending IP address for some period of time."

To prevent scripts from battering a server and to avoid rate-based denials, scrapers can tell their programs to wait a few seconds before each new page request. Sleep commands, a common tool in many languages' standard libraries, stall scripts for a set amount of time, helping regular traffic to pass through and allowing the server to catch its breath. More details can be found in many languages' standard documentation.

USER AGENT DENIAL

Because every browser or script accessing a Web site carries its name (known as a user agent) and other details within a standard HTTP request, system administrators have the power to exclude certain agents from accessing their Web sites.

Scraping modules in Perl, Python and other languages typically label themselves with a string representing the module name, which immediately identifies them as automated programs. You can change that string to one used by a common Web browser, like Firefox or Internet Explorer, or you can use it as a space to pass your contact information in case system administrators have any questions about your scrape. Either should be sufficient to overcome this denial.

CONCLUSION

There is little doubt that scraping countermeasures are evolving much faster than the CAR community can adapt to them. Most of us are not computer scientists, nor do we want to be.

So, instead of worrying about how to defeat the anti-scraping roadblocks, journalists should consider giving agencies fewer incentives to erect them. Among other things, that might mean not hogging bandwidth, being transparent and heeding any terms of use agreements you see.

If a Web site's terms of use — often found in the fine print on page footers — explicitly prohibit retrieving data using automated means, don't do it. Or at least ask permission. These types of restrictions are common on private Web sites, where the data is proprietary and often protected by law.

Also, scrape late at night. Hull, the Missouri Highway Patrol representative, said their department implemented the CAPTCHA system largely because of scrapers stalling traffic during peak hours. Sleep commands also help ensure average users can access the site with minimal slowdown.

If scraping traffic hadn't hurt peak-hour service to regular users, Hull said the department "almost definitely" would not have implemented countermeasures, leaving mash-up artists and journalists alike to scrape unhindered.

"To tie up the system like that really is what made us do it," Hull said. "We only have so much space going in and so much space going out."


Contact Chase Davis at Chase.Davis@chron.com