基于網(wǎng)頁的信息系統(tǒng)的一種預(yù)處理過程.doc
約57頁DOC格式手機(jī)打開展開
基于網(wǎng)頁的信息系統(tǒng)的一種預(yù)處理過程,57頁共計(jì)29240字摘要隨著web的迅速發(fā)展,web上的信息越來越豐富。web使用方便、信息豐富,人們越來越多的使用web來尋找需要的信息。為了更好的使用web上的信息,人們也不斷的追求能夠有效組織和利用網(wǎng)上信息的技術(shù)和系統(tǒng)。然而,web上的信息存在很多問題:網(wǎng)頁內(nèi)的噪音內(nèi)容多、web上近似網(wǎng)頁量大以及缺乏必要的元數(shù)...
內(nèi)容介紹
此文檔由會(huì)員 bfxqt 發(fā)布
57頁共計(jì)29240字
摘要
隨著Web的迅速發(fā)展,Web上的信息越來越豐富。Web使用方便、信息豐富,人們越來越多的使用Web來尋找需要的信息。為了更好的使用Web上的信息,人們也不斷的追求能夠有效組織和利用網(wǎng)上信息的技術(shù)和系統(tǒng)。然而,Web上的信息存在很多問題:網(wǎng)頁內(nèi)的噪音內(nèi)容多、Web上近似網(wǎng)頁量大以及缺乏必要的元數(shù)據(jù)信息,這些問題嚴(yán)重影響了Web信息系統(tǒng)的服務(wù)質(zhì)量。
針對(duì)Web信息系統(tǒng)的共性需求,本文提出了一個(gè)預(yù)處理框架及相應(yīng)的方法。該預(yù)處理框架包括了三個(gè)預(yù)處理工作:網(wǎng)頁凈化、近似網(wǎng)頁刪除和網(wǎng)頁元數(shù)據(jù)提取。通過預(yù)處理過程,原始網(wǎng)頁集中的近似網(wǎng)頁被刪除,而保留下來的網(wǎng)頁被凈化并轉(zhuǎn)化為一個(gè)統(tǒng)一的結(jié)構(gòu)化模型(稱之為DocView模型)。該模型中提供了各個(gè)領(lǐng)域需求較多的元數(shù)據(jù)和內(nèi)容數(shù)據(jù),它包括網(wǎng)頁標(biāo)識(shí)、網(wǎng)頁類型、內(nèi)容類別、標(biāo)題、關(guān)鍵詞、摘要、正文、相關(guān)鏈接等元素。本文提出的預(yù)處理方法的一個(gè)重要優(yōu)點(diǎn)是它不需要除原始網(wǎng)頁以外的其他信息,而這些額外信息是該領(lǐng)域中其他方法所必須的;另一個(gè)優(yōu)點(diǎn)是將Web信息系統(tǒng)的共性需求放到一個(gè)過程中一次性提取出來,可以避免相同中間過程的重復(fù)執(zhí)行,從而提高信息提取效率。
本文中提出的預(yù)處理框架和方法已經(jīng)應(yīng)用到了“天網(wǎng)”搜索引擎和網(wǎng)頁自動(dòng)分類系統(tǒng)中。通過使用預(yù)處理后應(yīng)用系統(tǒng)質(zhì)量的提高,驗(yàn)證了該預(yù)處理方法的有效性。不難看出,通過這樣一個(gè)預(yù)處理過程,可以在任何一個(gè)網(wǎng)頁集上(包括World Wide Web)搭建一個(gè)組織良好的、凈化的、更易使用的信息層。
Abstract
With the rapid expansion of the Web, the content of the Web become richer and richer. People are increasingly using Web to find their wanted information because of the Web’s convenience and its abundance of information. In order to make better use of Web information, technologies that can automatically re-organize and manipulate web pages are pursued such as Web information retrieval, Web page classification and other Web mining work. However, there are many noises in the Web such as the noise content in the Web page (local noise) and near replica Web pages in the Web (global noise), which decrease the quality of the information on the Web, and consequently descrease the quality of the Web information systems seriously. Also, meta data of the Web pages are widely used in Web information systems, but they are not described explicitly. Some of these problems are never met in the traditional work.
In this thesis, we propose a new preprocessing framework and the corresponding approach to meet the common requirements of several typical web information systems. The framework includes three parts: Web page cleaning, replica removal and meta data extraction. After the preprocessing stage, redundant Web pages are deleted, then, reserved Web pages are purified and transformed into a general model called DocView. The model consists of eight elements, identifier, type, content classification code, title, keywords, abstract, topic content and relevant hyperlinks. Most of them are meta data, while the latter two are content data. The main advantage of our approach is no need for other information beyond the raw page, while additional information is usually necessary for previous related work.
The preprocessing framework and approach have been applied to our search engine [TW] and web page classification system. The strong evidence of improvement in applications shows the practicability of the framework and verifies the validity of the approach. It's not difficult to realize that after such a preprocessing stage, we can set up a well-formed, purified, easily manipulated information layer on top of any Web page collection (including WWW) for Web information systems.
Keywords: World Wide Web, Data preprocessing, Data cleaning, Near replica detection, Meta data extraction
目 錄
第1章 引言 1
1.1 研究背景 1
1.2 本文研究內(nèi)容 2
1.3 本文貢獻(xiàn) 3
1.4 本文組織 3
第2章 相關(guān)研究 4
2.1 搜索引擎 4
2.2 網(wǎng)頁自動(dòng)分類 7
2.3 信息提取 9
2.4 元數(shù)據(jù)提取 10
第3章 Web信息系統(tǒng)面臨的問題及共性需求 12
第4章 預(yù)處理方法與技術(shù) 14
4.1 預(yù)處理框架及結(jié)果描述 14
4.1.1 預(yù)處理框架 14
4.1.2 預(yù)處理結(jié)果描述 14
4.2 網(wǎng)頁表示 15
4.2.1 網(wǎng)頁標(biāo)簽樹表示 16
4.2.2 網(wǎng)頁量化表示 19
4.3 網(wǎng)頁凈化 24
4.3.1 網(wǎng)頁類型判斷 24
4.3.2 主題網(wǎng)頁凈化 25
4.3.3 目錄網(wǎng)頁凈化 25
4.3.4 圖片網(wǎng)頁凈化 26
4.3.5 網(wǎng)頁凈化時(shí)空效率分析 26
4.4 近似網(wǎng)頁的發(fā)現(xiàn) 27
4.4.1 近似網(wǎng)頁發(fā)現(xiàn)算法 27
4.4.2 性能分析 29
4.5 網(wǎng)頁元數(shù)據(jù)提取 29
4.5.1 網(wǎng)頁元數(shù)據(jù)提取流程描述 30
4.5.2 正文提取 30
4.5.3 關(guān)鍵詞提取 30
4.5.4 內(nèi)容類別判斷 31
4.5.5 標(biāo)題提取 32
4.5.6 摘要提取 32
4.5.7 主題相關(guān)超鏈提取 33
4.6 本章小結(jié) 35
第5章 應(yīng)用與評(píng)測 36
5.1 網(wǎng)頁凈化在網(wǎng)頁自動(dòng)分類系統(tǒng)中的應(yīng)用與評(píng)測 36
5.1.1 應(yīng)用 36
5.1.2 評(píng)測標(biāo)準(zhǔn) 37
5.1.3 評(píng)測結(jié)果與分析 37
5.2 近似網(wǎng)頁消除在搜索引擎中的應(yīng)用與評(píng)測 38
5.2.1 實(shí)驗(yàn)設(shè)計(jì) 38
5.2.2 評(píng)測標(biāo)準(zhǔn) 39
5.2.3 評(píng)測結(jié)果與分析 40
5.3 網(wǎng)頁元數(shù)據(jù)在搜索引擎的索引過程中的應(yīng)用與評(píng)測 41
5.3.1 檢索效率評(píng)測 41
5.3.2 檢索精度評(píng)測 42
5.4 本章小結(jié) 44
第6章 總結(jié)與展望 45
6.1 總結(jié) 45
6.2 展望 45
參考資料 47
關(guān)鍵詞:萬維網(wǎng), 數(shù)據(jù)預(yù)處理,數(shù)據(jù)凈化,近似網(wǎng)頁識(shí)別,元數(shù)據(jù)提取
參考資料
[ACMP] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the Web. ACM Transactions on Internet Technology, 2001
[APE] Allison Woodruff, Paul M. Aoki, Eric Brewer, Paul Gauthier, and Lawrence A. Rowe. An Investigation of Documents from the World Wide Web. In Proceedings of the 5th International World Wide Web Conference, pages 963--979, Paris, France, May 1996.
[Fabrizio] Sebastinai Fabrizio. A tutorial on Automated text categorization.
[FSC] 馮是聰,中文網(wǎng)頁自動(dòng)分類技術(shù)研究及其在搜索引擎中的應(yīng)用,北京大學(xué),博士學(xué)位研究生畢業(yè)論文。
[Google] Google Inc. http://www.google.com .
[HCB] D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information Retrieval, 4(1):33-59, 2001.
[HCBG] D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information Retrieval, 4(1):33-59, 2001.
[HD98] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521-538, 1998.
[HITS] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999.
[HMC] J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured information from the web. In Proceedings of the Workshop on Management of Semistructured Data, pages 18-25, May 1997.
[JW] Cowie, Jim and Lehnert, Wendy. Information Extraction. Communications of the ACM, January 1996/Vol. 39, No. 1, pp 80 – 91.
[LD] Lewis D et al. Training algorithms for linear text classifiers. In Proceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp.298-306
[LG98] Steve Lawrence and C.Lee Giles. Searching the World Wide Web. Science, 280(5360): 98~100, Apr. 1998.
[LH02] S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. SIGKDD, 2002.
[LS] L. Xiaoli and S. Zhongzhi. Innovating web page classification through reducing noise. Journal of Computer Science and Technology, 17(1), January 2002.
[Manber94] U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1-10, San Fransisco, CA, USA, 1994.
[PR] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107-117, 1998.
[Ralph97] Grishman, Ralph. Information Extraction: Techniques and Challenges. Lecture Notes In Artificial Intelligence, Vol. 1299, pp 10 – 27, Springer-Verlag, Berlin Heidelberg, 1997. ISBN 3-540-63438-X
[SB] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988.
[SCAM] N. Shivakumar and H. Garc'ia-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
[SM99] N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In WEBDB: International Workshop on the World Wide Web and Databases, WebDB. LNCS, 1999.
摘要
隨著Web的迅速發(fā)展,Web上的信息越來越豐富。Web使用方便、信息豐富,人們越來越多的使用Web來尋找需要的信息。為了更好的使用Web上的信息,人們也不斷的追求能夠有效組織和利用網(wǎng)上信息的技術(shù)和系統(tǒng)。然而,Web上的信息存在很多問題:網(wǎng)頁內(nèi)的噪音內(nèi)容多、Web上近似網(wǎng)頁量大以及缺乏必要的元數(shù)據(jù)信息,這些問題嚴(yán)重影響了Web信息系統(tǒng)的服務(wù)質(zhì)量。
針對(duì)Web信息系統(tǒng)的共性需求,本文提出了一個(gè)預(yù)處理框架及相應(yīng)的方法。該預(yù)處理框架包括了三個(gè)預(yù)處理工作:網(wǎng)頁凈化、近似網(wǎng)頁刪除和網(wǎng)頁元數(shù)據(jù)提取。通過預(yù)處理過程,原始網(wǎng)頁集中的近似網(wǎng)頁被刪除,而保留下來的網(wǎng)頁被凈化并轉(zhuǎn)化為一個(gè)統(tǒng)一的結(jié)構(gòu)化模型(稱之為DocView模型)。該模型中提供了各個(gè)領(lǐng)域需求較多的元數(shù)據(jù)和內(nèi)容數(shù)據(jù),它包括網(wǎng)頁標(biāo)識(shí)、網(wǎng)頁類型、內(nèi)容類別、標(biāo)題、關(guān)鍵詞、摘要、正文、相關(guān)鏈接等元素。本文提出的預(yù)處理方法的一個(gè)重要優(yōu)點(diǎn)是它不需要除原始網(wǎng)頁以外的其他信息,而這些額外信息是該領(lǐng)域中其他方法所必須的;另一個(gè)優(yōu)點(diǎn)是將Web信息系統(tǒng)的共性需求放到一個(gè)過程中一次性提取出來,可以避免相同中間過程的重復(fù)執(zhí)行,從而提高信息提取效率。
本文中提出的預(yù)處理框架和方法已經(jīng)應(yīng)用到了“天網(wǎng)”搜索引擎和網(wǎng)頁自動(dòng)分類系統(tǒng)中。通過使用預(yù)處理后應(yīng)用系統(tǒng)質(zhì)量的提高,驗(yàn)證了該預(yù)處理方法的有效性。不難看出,通過這樣一個(gè)預(yù)處理過程,可以在任何一個(gè)網(wǎng)頁集上(包括World Wide Web)搭建一個(gè)組織良好的、凈化的、更易使用的信息層。
Abstract
With the rapid expansion of the Web, the content of the Web become richer and richer. People are increasingly using Web to find their wanted information because of the Web’s convenience and its abundance of information. In order to make better use of Web information, technologies that can automatically re-organize and manipulate web pages are pursued such as Web information retrieval, Web page classification and other Web mining work. However, there are many noises in the Web such as the noise content in the Web page (local noise) and near replica Web pages in the Web (global noise), which decrease the quality of the information on the Web, and consequently descrease the quality of the Web information systems seriously. Also, meta data of the Web pages are widely used in Web information systems, but they are not described explicitly. Some of these problems are never met in the traditional work.
In this thesis, we propose a new preprocessing framework and the corresponding approach to meet the common requirements of several typical web information systems. The framework includes three parts: Web page cleaning, replica removal and meta data extraction. After the preprocessing stage, redundant Web pages are deleted, then, reserved Web pages are purified and transformed into a general model called DocView. The model consists of eight elements, identifier, type, content classification code, title, keywords, abstract, topic content and relevant hyperlinks. Most of them are meta data, while the latter two are content data. The main advantage of our approach is no need for other information beyond the raw page, while additional information is usually necessary for previous related work.
The preprocessing framework and approach have been applied to our search engine [TW] and web page classification system. The strong evidence of improvement in applications shows the practicability of the framework and verifies the validity of the approach. It's not difficult to realize that after such a preprocessing stage, we can set up a well-formed, purified, easily manipulated information layer on top of any Web page collection (including WWW) for Web information systems.
Keywords: World Wide Web, Data preprocessing, Data cleaning, Near replica detection, Meta data extraction
目 錄
第1章 引言 1
1.1 研究背景 1
1.2 本文研究內(nèi)容 2
1.3 本文貢獻(xiàn) 3
1.4 本文組織 3
第2章 相關(guān)研究 4
2.1 搜索引擎 4
2.2 網(wǎng)頁自動(dòng)分類 7
2.3 信息提取 9
2.4 元數(shù)據(jù)提取 10
第3章 Web信息系統(tǒng)面臨的問題及共性需求 12
第4章 預(yù)處理方法與技術(shù) 14
4.1 預(yù)處理框架及結(jié)果描述 14
4.1.1 預(yù)處理框架 14
4.1.2 預(yù)處理結(jié)果描述 14
4.2 網(wǎng)頁表示 15
4.2.1 網(wǎng)頁標(biāo)簽樹表示 16
4.2.2 網(wǎng)頁量化表示 19
4.3 網(wǎng)頁凈化 24
4.3.1 網(wǎng)頁類型判斷 24
4.3.2 主題網(wǎng)頁凈化 25
4.3.3 目錄網(wǎng)頁凈化 25
4.3.4 圖片網(wǎng)頁凈化 26
4.3.5 網(wǎng)頁凈化時(shí)空效率分析 26
4.4 近似網(wǎng)頁的發(fā)現(xiàn) 27
4.4.1 近似網(wǎng)頁發(fā)現(xiàn)算法 27
4.4.2 性能分析 29
4.5 網(wǎng)頁元數(shù)據(jù)提取 29
4.5.1 網(wǎng)頁元數(shù)據(jù)提取流程描述 30
4.5.2 正文提取 30
4.5.3 關(guān)鍵詞提取 30
4.5.4 內(nèi)容類別判斷 31
4.5.5 標(biāo)題提取 32
4.5.6 摘要提取 32
4.5.7 主題相關(guān)超鏈提取 33
4.6 本章小結(jié) 35
第5章 應(yīng)用與評(píng)測 36
5.1 網(wǎng)頁凈化在網(wǎng)頁自動(dòng)分類系統(tǒng)中的應(yīng)用與評(píng)測 36
5.1.1 應(yīng)用 36
5.1.2 評(píng)測標(biāo)準(zhǔn) 37
5.1.3 評(píng)測結(jié)果與分析 37
5.2 近似網(wǎng)頁消除在搜索引擎中的應(yīng)用與評(píng)測 38
5.2.1 實(shí)驗(yàn)設(shè)計(jì) 38
5.2.2 評(píng)測標(biāo)準(zhǔn) 39
5.2.3 評(píng)測結(jié)果與分析 40
5.3 網(wǎng)頁元數(shù)據(jù)在搜索引擎的索引過程中的應(yīng)用與評(píng)測 41
5.3.1 檢索效率評(píng)測 41
5.3.2 檢索精度評(píng)測 42
5.4 本章小結(jié) 44
第6章 總結(jié)與展望 45
6.1 總結(jié) 45
6.2 展望 45
參考資料 47
關(guān)鍵詞:萬維網(wǎng), 數(shù)據(jù)預(yù)處理,數(shù)據(jù)凈化,近似網(wǎng)頁識(shí)別,元數(shù)據(jù)提取
參考資料
[ACMP] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the Web. ACM Transactions on Internet Technology, 2001
[APE] Allison Woodruff, Paul M. Aoki, Eric Brewer, Paul Gauthier, and Lawrence A. Rowe. An Investigation of Documents from the World Wide Web. In Proceedings of the 5th International World Wide Web Conference, pages 963--979, Paris, France, May 1996.
[Fabrizio] Sebastinai Fabrizio. A tutorial on Automated text categorization.
[FSC] 馮是聰,中文網(wǎng)頁自動(dòng)分類技術(shù)研究及其在搜索引擎中的應(yīng)用,北京大學(xué),博士學(xué)位研究生畢業(yè)論文。
[Google] Google Inc. http://www.google.com .
[HCB] D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information Retrieval, 4(1):33-59, 2001.
[HCBG] D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information Retrieval, 4(1):33-59, 2001.
[HD98] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521-538, 1998.
[HITS] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999.
[HMC] J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured information from the web. In Proceedings of the Workshop on Management of Semistructured Data, pages 18-25, May 1997.
[JW] Cowie, Jim and Lehnert, Wendy. Information Extraction. Communications of the ACM, January 1996/Vol. 39, No. 1, pp 80 – 91.
[LD] Lewis D et al. Training algorithms for linear text classifiers. In Proceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp.298-306
[LG98] Steve Lawrence and C.Lee Giles. Searching the World Wide Web. Science, 280(5360): 98~100, Apr. 1998.
[LH02] S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. SIGKDD, 2002.
[LS] L. Xiaoli and S. Zhongzhi. Innovating web page classification through reducing noise. Journal of Computer Science and Technology, 17(1), January 2002.
[Manber94] U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1-10, San Fransisco, CA, USA, 1994.
[PR] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107-117, 1998.
[Ralph97] Grishman, Ralph. Information Extraction: Techniques and Challenges. Lecture Notes In Artificial Intelligence, Vol. 1299, pp 10 – 27, Springer-Verlag, Berlin Heidelberg, 1997. ISBN 3-540-63438-X
[SB] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988.
[SCAM] N. Shivakumar and H. Garc'ia-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
[SM99] N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In WEBDB: International Workshop on the World Wide Web and Databases, WebDB. LNCS, 1999.