Project Details
Extracting data from template-based websites
Laboratory : LSIR | Master | Completed |
Description:
There are many website that are dynamically generated from a database and a template. Examples include e-commerce websites such as Amazon, ads such as craiglist.com or flight schedules such as swiss.com.
This project consists in implementing an algorithm that takes a set of template-generated pages from one given website, automatically learns the template and extracts the data from the template. The starting point is the publication titled Extracting Structured Data from Web Pages, Arasu, Stanford.
Tasks:
- Implement the algorithm proposed in the cited publication
- Run and analyze the success rate for a set of given websites
- Propose improvements
- Implement a crawler suited for the task
Requirements
- Expertise in Java or Python
- Previous work on unsupervised learning methods
This project will be jointly supervised by David Portabella (at http://db4all.com/) and Zoltan Miklos
Site: | |
Contact: | Zoltan Miklos |