SCOPUS 정보 검색 플랫폼

Proceedings of the 19th International Conference on World Wide Web, WWW '10

Volumn , Issue , 2010, Pages 971-980

CETR - Content extraction via tag ratios

(3) Weninger, Tim a Hsu, William H b Han, Jiawei a

a University of Illinois at Urbana Champaign (United States)

b Kansas State University (United States)

Author keywords

content extraction; tag ratio; world wide web

Indexed keywords

ALTERNATIVE METHODS; CLUSTERING TECHNIQUES; CONTENT EXTRACTION; EXISTING METHOD; HTML DOCUMENTS; PRECISION AND RECALL; TWO DIMENSIONAL MODEL; TWO-DIMENSION; WEB CORPORA; WEB DOMAINS; WEB PAGE;

GRAPHIC METHODS;

WORLD WIDE WEB;

EID: 77954569037 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/1772690.1772789 Document Type: Conference Paper

Times cited : (120)

References (33)

1
- 77954576560
- IEEE Computer Society
- 19th International Workshop on Database and Expert Systems Applications (DEXA 2008), 1-5 September 2008, Turin, Italy. IEEE Computer Society, 2008.
- (2008) 19th International Workshop on Database and Expert Systems Applications (DEXA 2008), 1-5 September 2008, Turin, Italy

2
- 0032092761
- Nodose - A tool for semi-automatically extracting semi-structured data from text documents
- ACM Press
- B. Adelberg. Nodose - a tool for semi-automatically extracting semi-structured data from text documents. In SIGMOD Conference, pages 283-294. ACM Press, 1998.
- (1998) SIGMOD Conference , pp. 283-294
- Adelberg, B.¹

3
- 77953052174
- Template detection via data mining and its applications
- Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In WWW, pages 580-591, 2002.
- (2002) WWW , pp. 580-591
- Bar-Yossef, Z.¹ Rajagopalan, S.²

4
- 0035029462
- Accordion summarization for end-game browsing on pdas and cellular phones
- O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Accordion summarization for end-game browsing on pdas and cellular phones. In CHI, pages 213-220, 2001.
- (2001) CHI , pp. 213-220
- Buyukkokten, O.¹ Garcia-Molina, H.² Paepcke, A.³

5
- 8644241107
- Block-level link analysis
- ACM
- D. Cai, X. He, J.-R.Wen, andW.-Y. Ma. Block-level link analysis. In SIGIR, pages 440-447. ACM, 2004.
- (2004) SIGIR , pp. 440-447
- Cai, D.¹ He, X.² Wen, J.-R.³ Ma, W.-Y.⁴

6
- 21144444733
- Extracting content structure for web pages based on visual representation
- APWeb, Springer
- D. Cai, S. Yu, J.-R.Wen, andW.-Y. Ma. Extracting content structure for web pages based on visual representation. In APWeb, volume 2642 of Lecture Notes in Computer Science, pages 406-417. Springer, 2003.
- (2003) Lecture Notes in Computer Science , vol.2642 , pp. 406-417
- Cai, D.¹ Yu, S.² Wen, J.-R.³ Ma, W.-Y.⁴

7
- 18744372151
- Misuse detection for information retrieval systems
- ACM
- R. Cathey, L. Ma, N. Goharian, and D. A. Grossman. Misuse detection for information retrieval systems. In CIKM, pages 183-190. ACM, 2003.
- (2003) CIKM , pp. 183-190
- Cathey, R.¹ Ma, L.² Goharian, N.³ Grossman, D.A.⁴

8
- 84893227548
- Function-based object model towards website adaptation
- ACM Press
- J. Chen, B. Zhou, and H. Zhang. Function-based object model towards website adaptation. In In Proceedings of the 10th International World Wide Web Conference, pages 587-596. ACM Press, 2001.
- (2001) Proceedings of the 10th International World Wide Web Conference , pp. 587-596
- Chen, J.¹ Zhou, B.² Zhang, H.³

9
- 33751046629
- Template detection for large scale search engines
- ACM
- L. Chen, S. Ye, and X. Li. Template detection for large scale search engines. In SAC, pages 1094-1098. ACM, 2006.
- (2006) SAC , pp. 1094-1098
- Chen, L.¹ Ye, S.² Li, X.³

10
- 4644340823
- Automatic web news extraction using tree edit distance
- ACM
- D. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender. Automatic web news extraction using tree edit distance. In WWW, pages 502-511. ACM, 2004.
- (2004) WWW , pp. 502-511
- De Castro Reis, D.¹ Golgher, P.B.² Da Silva, A.S.³ Laender, A.H.F.⁴

11
- 26844469211
- Automatic extraction of informative blocks from webpages
- ACM
- S. Debnath, P. Mitra, and C. L. Giles. Automatic extraction of informative blocks from webpages. In SAC, pages 1722-1726. ACM, 2005.
- (2005) SAC , pp. 1722-1726
- Debnath, S.¹ Mitra, P.² Giles, C.L.³

12
- 26944496810
- Identifying content blocks from web documents
- ISMIS, Springer
- S. Debnath, P. Mitra, and C. L. Giles. Identifying content blocks from web documents. In ISMIS, volume 3488 of Lecture Notes in Computer Science, pages 285-293. Springer, 2005.
- (2005) Lecture Notes in Computer Science , vol.3488 , pp. 285-293
- Debnath, S.¹ Mitra, P.² Giles, C.L.³

13
- 50149101261
- Fact or fiction: Content classification for digital libraries
- A. Finn, N. Kushmerick, and B. Smyth. Fact or fiction: Content classification for digital libraries. In DELOS Workshop: Personalization and Recommender Systems in Digital Libraries, 2001.
- DELOS Workshop: Personalization and Recommender Systems in Digital Libraries, 2001
- Finn, A.¹ Kushmerick, N.² Smyth, B.³

14
- 57849154238
- Evaluating content extraction on html documents
- T. Gottron. Evaluating content extraction on html documents. In ITA, pages 123-132, 2007.
- (2007) ITA , pp. 123-132
- Gottron, T.¹

15
- 70349131454
- Combining content extraction heuristics: The ombine system
- ACM
- T. Gottron. Combining content extraction heuristics: the ombine system. In iiWAS, pages 591-595. ACM, 2008.
- (2008) iiWAS , pp. 591-595
- Gottron, T.¹

16
- 57849123029
- Content code blurring: A new approach to content extraction
- T. Gottron. Content code blurring: A new approach to content extraction. In DEXA Workshops [1], pages 29-33.
- DEXA Workshops , Issue.1 , pp. 29-33
- Gottron, T.¹

17
- 14844363192
- Automating content extraction of html documents
- S. Gupta, G. E. Kaiser, P. Grimm, M. F. Chiang, and J. Starren. Automating content extraction of html documents. World Wide Web, 8(2):179-224, 2005.
- (2005) World Wide Web , vol.8 , Issue.2 , pp. 179-224
- Gupta, S.¹ Kaiser, G.E.² Grimm, P.³ Chiang, M.F.⁴ Starren, J.⁵

18
- 84880498138
- Dom-based content extraction of html documents
- S. Gupta, G. E. Kaiser, D. Neistadt, and P. Grimm. Dom-based content extraction of html documents. In WWW, pages 207-214, 2003.
- (2003) WWW , pp. 207-214
- Gupta, S.¹ Kaiser, G.E.² Neistadt, D.³ Grimm, P.⁴

19
- 38149067816
- Extracting context to improve accuracy for html content extraction
- ACM
- S. Gupta, G. E. Kaiser, and S. J. Stolfo. Extracting context to improve accuracy for html content extraction. In WWW (Special interest tracks and posters), pages 1114-1115. ACM, 2005.
- (2005) WWW (Special Interest Tracks and Posters) , pp. 1114-1115
- Gupta, S.¹ Kaiser, G.E.² Stolfo, S.J.³

20
- 0002985122
- Wrapping web data into xml
- W. Han, D. Buttler, and C. Pu. Wrapping web data into xml. SIGMOD Rec., 30(3):33-38, 2001.
- (2001) SIGMOD Rec. , vol.30 , Issue.3 , pp. 33-38
- Han, W.¹ Buttler, D.² Pu, C.³

21
- 0742268832
- Mining web informative structures and contents based on entropy analysis
- H.-Y. Kao, S.-H. Lin, J.-M. Ho, and M.-S. Chen. Mining web informative structures and contents based on entropy analysis. IEEE Trans. Knowl. Data Eng., 16(1):41-55, 2004.
- (2004) IEEE Trans. Knowl. Data Eng. , vol.16 , Issue.1 , pp. 41-55
- Kao, H.-Y.¹ Lin, S.-H.² Ho, J.-M.³ Chen, M.-S.⁴

22
- 0034172374
- Wrapper induction: Efficiency and expressiveness
- N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118(1-2):15-68, 2000.
- (2000) Artificial Intelligence , vol.118 , Issue.1-2 , pp. 15-68
- Kushmerick, N.¹

23
- 0242456776
- Discovering informative content blocks from web documents
- ACM
- S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. In KDD, pages 588-593. ACM, 2002.
- (2002) KDD , pp. 588-593
- Lin, S.-H.¹ Ho, J.-M.²

24
- 0001457509
- Some methods for classification and analysis of multivariate observations
- J. MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematics Statistics and Probability, pages 281-297, 1967.
- (1967) Berkeley Symposium on Mathematics Statistics and Probability , pp. 281-297
- MacQueen, J.¹

25
- 77950332467
- Separating xhtml content from navigation clutter using dom-structure block analysis
- S. Reich and M. Tzagarakis, editors, ACM
- C. Mantratzis, M. A. Orgun, and S. Cassidy. Separating xhtml content from navigation clutter using dom-structure block analysis. In S. Reich and M. Tzagarakis, editors, Hypertext, pages 145-147. ACM, 2005.
- (2005) Hypertext , pp. 145-147
- Mantratzis, C.¹ Orgun, M.A.² Cassidy, S.³

26
- 84938812620
- Template detection through conditional random fields
- M. Marek, P. Pecina, and M. Spousta. Template detection through conditional random fields. In WAC3, 2007.
- (2007) WAC3
- Marek, M.¹ Pecina, P.² Spousta, M.³

27
- 0035587215
- Hierarchical wrapper induction for semistructured information sources
- I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1-2):93-114, 2001.
- (2001) Autonomous Agents and Multi-Agent Systems , vol.4 , Issue.1-2 , pp. 93-114
- Muslea, I.¹ Minton, S.² Knoblock, C.A.³

28
- 84865651487
- Extracting article text from the web with maximum subsequence segmentation
- ACM
- J. Pasternack and D. Roth. Extracting article text from the web with maximum subsequence segmentation. In WWW, pages 971-980. ACM, 2009.
- (2009) WWW , pp. 971-980
- Pasternack, J.¹ Roth, D.²

29
- 0036989234
- Quasm: A system for question answering using semi-structured data
- ACM
- D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. Quasm: a system for question answering using semi-structured data. In JCDL, pages 46-55. ACM, 2002.
- (2002) JCDL , pp. 46-55
- Pinto, D.¹ Branstein, M.² Coleman, R.³ Croft, W.B.⁴ King, M.⁵ Li, W.⁶ Wei, X.⁷

30
- 0038144389
- Content extraction from html documents
- A. F. R. Rahman, H. Alam, and R. Hartono. Content extraction from html documents. In WDA, pages 7-10, 2001.
- (2001) WDA , pp. 7-10
- Rahman, A.F.R.¹ Alam, H.² Hartono, R.³

31
- 58849102735
- Toward 2w, beyond web 2.0
- T. V. Raman. Toward 2w, beyond web 2.0. Commun. ACM, 52(2):52-59, 2009.
- (2009) Commun. ACM , vol.52 , Issue.2 , pp. 52-59
- Raman, T.V.¹

32
- 57849147691
- Text extraction from the web via text-to-tag ratio
- T. Weninger and W. H. Hsu. Text extraction from the web via text-to-tag ratio. In DEXA Workshops [1], pages 23-28.
- DEXA Workshops , Issue.1 , pp. 23-28
- Weninger, T.¹ Hsu, W.H.²

33
- 77952370025
- Eliminating noisy information in web pages for data mining
- ACM
- L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In KDD, pages 296-305. ACM, 2003.
- (2003) KDD , pp. 296-305
- Yi, L.¹ Liu, B.² Li, X.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.