Scraping Yahoo! for contacts using JScrape

This post builds on my previous post, in which we discuss how to scrape webmail sites for contacts. Yahoo! is by far the easiest of the sites to scrape (of the major sites). After you’ve sniffed the URLs used for the login you just need to replace the username and password for the login. Yahoo! currently does not use any JavaScript tricks or special cookies during the login. Using JScrape as-is should be sufficient. The one trick to Yahoo is that it breaks up the address book into seperate pages. In my solution I dynamically grab these URL’s using the following snippet of code:


public String[] getURLs()
{

String q = “declare namespace xhtml=\”http://www.w3.org/1999/xhtml\”; \n” +
”for $d in //xhtml:ol[@id=’abcnav’]/xhtml:li/xhtml:a \n”+
” return
  • { $d/@href/string() }
  • “;



    //pScrape is a com.apsquared.jscrape.PageScraper object that has already logged in to the site.
    List l = pScrape.scrapePageForList(“http://address.yahoo.com/yab/us”, q);
    if (l == null)
    return null;


    String[] ret = new String[l.size()];
    for (int i = 0; i < l.size() ; i++)
    {
    TinyNodeImpl ti = (TinyNodeImpl)l.get(i);
    ret[i] = new String(ti.getStringValue());
    }
    return ret;
    }


    Note: this may return null if the user account only has a small # of contacts.

    For each url returned you need to scrape the page looking for the contacts. I used the following XQuery for that scrape:


    declare namespace xhtml=\”http://www.w3.org/1999/xhtml\”;
    for $d in //xhtml:td[@class=’contactnumbers’]/xhtml:span/xhtml:a
    return
  • { data($d) }



  • That’s about it, as we’ll see in the next few days this is much simpler than many other sites (GMail, Hotmail, AOL) as they require many more tricks to login.