Web Page Sectioning Using Regex??-based Template
- Rupesh R. Mehta(Yahoo! R&D)
- Amit Madaan(Yahoo! R&D)
This work aims to provide a novel, site-specific web page segmentation and section importance detection algorithm, which leverages structural, content, and visual information. The structural and content information is leveraged via template, a generalized regular expression learnt over set of pages. The template along with visual information results into high sectioning accuracy. The experimental results demonstrate the effectiveness of the approach.
Inquiries can be sent to: