The STIP lab proposes two projects in data science. The context for these projects is the IPRoduct research project, which seeks to build a large-scale database of products and the patents that protect them. The objective is to crawl the web in search of “virtual patent marking” (VPM) webpages (example) and extract the information on patent-product pairs.
We propose two research projects:
1. Implementation of a VPM classifier. Starting from a set of webpages, the goal is to identify which pages provide VPM information and which pages don’t.
2. Automatic extraction of information. Starting from a set of VPM webpages, the goal is to extract the product names and the associated patents. Some cases are fairly trivial (highly structured information) but other cases are more challenging (information buried in text).
A gold-standard dataset is available for both projects and will be shared with students on a confidential basis.
The responsible professor is Gaétan de Rassenfosse (STIP lab) and the senior scientist is Dr. David Portabella. The projects are available for Bachelor and Master students in Computer Science, Communication Systems or Data Science. The exact scope of the projects can be tailored to fit SIN/SSC semester projects (group of two students allowed) or SSC/SIN/DS master projects.