This is the 3rd post in a short series discussing how I built an API to grab contact list information from Yahoo!, AOL, GMail and Hotmail. In our first post we reviewed the high level approach to scraping sites. In our second post we went over how to scrape Yahoo! - which is by far the easiest of the 4 sites to scrape. This post will discuss how to scrape AOL which is much more challenging as it requires some cookie manipulation and some javascript emulation. The tips below aren’t necessarily the best way to do this but it worked for me.
For working with AOL you need to work with the HttpClient and PostMethod objects, from the Apache Commons HttpClient API, directly. For all URLs you post to make sure to set User-Agent and set the cookie policy:
post.getParams().setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY);
post.setRequestHeader(“User-Agent”,” Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)”);
Also for each post I set the Referrer attribute to the previous URL. After you post to the first URL you’ll need to process all the hidden variables that are returned and add them to next post. Also there was a cookie that I seemed to need to manually add, to do so I used the following snippet of code:
Cookie[] c = client.getState().getCookies();
String cStr = “”;
for (int i = 0 ; i < c.length; i++)
cStr += c[i].getName()+“=”+c[i].getValue()+“; “;
cStr+=“s_cc=true; s_sq=aolsnssignin%2Caolsvc%3D%2526pid%253Dsso%252520%25253A%252520login%2526pidt%253D1%2526oid%253DSign%252520In%2526oidt%253D3%2526ot%253DSUBMIT%2526oi%253D97″;
post.setRequestHeader(“Cookie”,cStr); This second post should also contain the user name and password. This is the first part of the login. In the response you’ll find that there is javascript that will forward to a new specific URL, you need to get it dynamically. I used the following code:
int onLoad = data.indexOf(“
int http = data.indexOf(“http:”,onLoad);