Scraping WebMail sites for contacts using JScrape

Many new websites, especially those that depend on social networks, are now offering ways to import contacts from various WebMail sites. I’m not going to go into the ethics of asking a user for their user name and password to a webmail site and scraping the site but I will touch on the technical challenges. I started by building JScrape, a Java API that makes scraping websites easier. I then decided to try to scrape contact lists from Yahoo!, GMail, Hotmail and AOL. I found that each of these sites had their own challenges. The easiest by far was Yahoo!, so that is what I’ll start with. I’m not going to provide the exact code but will give you tips that will definetly get you going.

The basic process for all of these sites is:

1) Use a tool (such as Fiddler or Ethereal) to capture the network traffic that occurs when you login to the site.
2) Each site will use different cookies and JS to make logging in more challenging (this is the hard part).
3) Use the same session and post to the address book page for that site.
4) Use JScrape to parse out the email addresses that you want. You may need to page through different pages depending on the number of email addresses (and how the site displays the addresses).

Sounds simple eh? Well step #2 can be quite challengine and frustrating. I will add a new blog entry for each of the different sites and how to “login” to them, so check back soon.