Diffbot’s mission is to convert the unstructured data of the web into a structured database. We identified 20 page types on the web, and were implementing separate extraction algorithms for each page type. The extraction algorithms for each page type are also offered as stand-alone APIs. At the time I left in August, APIs were available for extracting information from article pages, product pages, and image pages. We were also working on a classifier for determining the page type of an arbitrary page.
This talk will discuss a variety of approaches used for information extraction of web data, presenting some of the ways that we used machine learning algorithms in practice. The distinction between machine-learned and rule-based web data extraction will be addressed. I will discuss some of the algorithm features that were useful for extracting data on the web.
If time permits, additional topics will be covered, including the challenges of scaling a web API, issues arising when crawling the web, example applications that used Diffbot, and some ideas that I was unable to implement (either due to cost or time constraints).