Web Scraping FAQ
Q&A with John Perry
Posted June 1, 2007
What is Web scraping?It is the automated collecting of info from Web sites.
In the realm of journalism what are the benefits of Web scraping?You can monitor changes on government Web sites and you can download online databases if they're reluctant to give you the raw data.
Can you give me some examples of Web scraping outside of journalism?A lot of companies use it for competitive intelligence, to keep track of competitors' Web sites. Some people also use it for personal reasons � People can create their own RSS feeds, basically.
Is Web scraping legal?There is really no difference with you sitting there with a browser. Instead of having a room full of interns manually going to Web sites and copying and pasting the data, the script does it for you.
If you hit a Web site too quickly, it could be interpreted as a denial of service attack. You have to reduce the number of requests per second.
Some sites, especially commercial sites, have usage agreements. For example, Google: If you scrape Google and they notice, they will block your IP. They want you to use their developer tools.
For government sites, those are public records so as long as you are respectful of their bandwidth and server capabilities, they should not have any grounds for complaint. Theoretically, they are putting this info up for the public to use.
How does one get started Web scraping? Probably the first step would be to learn a scripting language such as PHP, Perl, or Python.
How much technical knowledge is needed to start Web scraping? I think it is fairly easy to get started. You can do some really powerful things with beginner scripts.
What are some of the tools that Web scrapers use and would you recommend something different for beginners?Every CAR person ought to learn a scripting language. There is really nothing more powerful than Perl or Python for automating maintenance and much more powerful searches than straight sequel.
What other tasks is Perl used for? What does Perl compare to?Anything that involves parsing text. If you have data from an agency that is weird and hard to get into a database, you can use Perl.
How long does it typically take to do something like this?Depends on how complicated the page is... A simple page takes a half an hour to an hour.
Would you write a scraper for a page that you would only use once? If there is a lot of stuff there, then yes. Like one of the things that we did at [The Center for Public Integrity] all the senate campaign finance records are in PDF files. So, we wrote a scraper that went to the FTC site and downloaded several years of those files, and it ran for weeks.
How do you learn which modules to use, and how they work? There is a whole Perl community. Perl.com, which is maintained by O'Reilly, has articles up there monthly that keep up with developments in Perl. CPAN has an email list (http://lists.cpan.org/), so you can subscribe to that and it tells you when there are updates and new modules� When you install a package using PPM, it automatically installs the documentation for that package.
How do you run Perl in a page with a lot of JavaScript?JavaScript is a big problem. Web browsers have JavaScript engines embedded in them, so they're running the JavaScript. Your script doesn't have a JavaScript engine embedded. So, what you have to do is you have to look at the JavaScript, figure what it is doing and then write your JavaScript so it pretends to be the JavaScript running and give the server what it expects from the JavaScript.
Sometimes it is detective work trying to figure out what it is doing. Another Firefox tool is another extension that captures the HTTP request and response headers. So you can look in on the browser and the server talking to each other.
For those who have a very basic understanding of Web scraping, what is the next step? The first step would be to learn a scripting language. I would also recommend Firefox because it has extensions for Web developers that are also good for analyzing a Web site and grabbing the info you need.
John Perry is a senior fellow at The Center for Public Integrity.
Some more WebScraping