/ プログラム/ 発表一覧/ 著者一覧/ 企業展示一覧/ jsai2012ホーム /

4M1-IOS-3c-2 FivaTech2: A Supervised Approach to Role Differentiation for Web Data Extraction From Template pages

*セッションの無断動画配信はご遠慮下さい。

Tweet #jsai2012 このエントリーをはてなブックマークに追加

06月15日(Fri) 09:00〜12:20 M会場(-山口県自治会館/大会議室(80))
4M1-IOS-3c International Organized Session「Special Session on Web Intelligence & Data Mining (3)」

演題番号4M1-IOS-3c-2
題目FivaTech2: A Supervised Approach to Role Differentiation for Web Data Extraction From Template pages
著者Chia-Hui Chang(National Central University, Taiwan)
Chih-Hao Chang(National Central University, Taiwan)
Mohammed Kayed(Beni-Suef University, Egypt)
時間06月15日(Fri) 09:30〜10:00
概要A huge amount of consolidated information on the World Wide Web are embedded in HTML pages as they are generated dynamically from databases through some search form. This paper proposes a page-level web data extraction system FiVaTech2 that extracts schema and templates from these template-based web pages automatically. The proposed system, FiVaTech2, is an extension to our previously page-level web data extraction system FiVaTech. FiVaTech2 uses a machine learning (ML) based method which compares HTML tag pairs to estimate how likely they present in the web pages. We use one of the ML techniques called J48 decision tree classifier and also use image comparison to assist templates detection. Each HTML tag in the web page has several features that can be divided into the three types: visual information, DOM tree information, and HTML tag contents. Our experiments show an encouraging result for the test pages when combinations of the three types of tag features are used. Also, our experiments show that FiVaTech2 performs better and has higher efficiency than FiVaTech.
論文PDFファイル