메뉴 건너뛰기




Volumn 47, Issue 12, 2010, Pages 2025-2036

A survey of Web page cleaning research

Author keywords

Data mining; Information retrieval; Web mining; Web page cleaning; WWW

Indexed keywords

CLEANING METHODS; DATA SETS; DE-NOISE; EXISTING PROBLEMS; EXPERIMENTAL METHODS; FUTURE DIRECTIONS; IN-DEPTH STUDY; MODEL-BASED; MULTI-MODEL; RAPID DEVELOPMENT; RESEARCH AND APPLICATION; WEB APPLICATION; WEB DATA; WEB MINING; WEB PAGE; WWW;

EID: 78650954485     PISSN: 10001239     EISSN: None     Source Type: Journal    
DOI: None     Document Type: Article
Times cited : (7)

References (48)
  • 4
    • 33751046629 scopus 로고    scopus 로고
    • Template detection for large scale search engines
    • New York: ACM
    • Chen L, Ye S, Li X. Template detection for large scale search engines[C]//Proc of the 2006 ACM Symp on Applied Computing. New York: ACM, 2006: 1094-1098
    • (2006) Proc of the ACM Symp on Applied Computing , pp. 1094-1098
    • Chen, L.1    Ye, S.2    Li, X.3
  • 9
    • 0033717069 scopus 로고    scopus 로고
    • Efficient deformable template detection and localization without user initialization
    • Coughlan J, Yuille A, English C, et al. Efficient deformable template detection and localization without user initialization[J]. Computer Vision Image Understanding, 2000, 22(78): 303-319
    • (2000) Computer Vision Image Understanding , vol.22 , Issue.78 , pp. 303-319
    • Coughlan, J.1    Yuille, A.2    English, C.3
  • 19
    • 78650961845 scopus 로고    scopus 로고
    • An algorithm for noise reduction in Web pages based on a group of content-related rules
    • Chinese source
    • Wang Jiandong, Wang Jimin, Tian Feijia. An algorithm for noise reduction in Web pages based on a group of content-related rules[J]. New Technology of Library and Information Service, 2008, 22(3): 51-54 (in Chinese)
    • (2008) New Technology of Library and Information Service , vol.22 , Issue.3 , pp. 51-54
    • Wang, J.1    Wang, J.2    Tian, F.3
  • 20
    • 78650947119 scopus 로고    scopus 로고
    • An approach to eliminate noise based on framework of Web pages and rules
    • Chinese source
    • Shi Daming, Lin Hongfei, Yang Zhihao, et al. An approach to eliminate noise based on framework of Web pages and rules[J]. Computer Engineering, 2007, 33(19): 276-278 (in Chinese)
    • (2007) Computer Engineering , vol.33 , Issue.19 , pp. 276-278
    • Shi, D.1    Lin, H.2    Yang, Z.3
  • 21
    • 33745118162 scopus 로고    scopus 로고
    • An HTML parser to improve Chinese search engines
    • Chinese source
    • Song Ruihua, Ma Shaoping, Chen Gang, et al. An HTML parser to improve Chinese search engines[J]. Journal of Chinese Information Processing, 2003, 17(4): 19-26 (in Chinese)
    • (2003) Journal of Chinese Information Processing , vol.17 , Issue.4 , pp. 19-26
    • Song, R.1    Ma, S.2    Chen, G.3
  • 25
    • 8644236286 scopus 로고    scopus 로고
    • VIPS: A vision based page segmentation algorithm, MSR-TR-2003-79
    • Seattle, USA: Microsoft, 2003-11, 2009-02-01
    • Cai D, Yu S, Wen J R, et al. VIPS: A vision based page segmentation algorithm, MSR-TR-2003-79[R/OL]. Seattle, USA: Microsoft, (2003-11) [2009-02-01]. http://research.microsoft.com/apps/pubs/default.aspx?id=70027
    • Cai, D.1    Yu, S.2    Wen, J.R.3
  • 26
    • 84880475213 scopus 로고    scopus 로고
    • Improving pseudo-relevance feedback in Web information retrieval using Web page segmentation
    • New York: ACM
    • Yu S, Cai D, Wen J R, et al. Improving pseudo-relevance feedback in Web information retrieval using Web page segmentation[C]//Proc of the 12th World Wide Web Conf. New York: ACM, 2003
    • (2003) Proc of the 12th World Wide Web Conf
    • Yu, S.1    Cai, D.2    Wen, J.R.3
  • 28
    • 19944426093 scopus 로고    scopus 로고
    • An algorithm for the elimination of the noise in Web pages based on visual layout information
    • Chinese source
    • Jing Tao, Zuo Wanli. An algorithm for the elimination of the noise in Web pages based on visual layout information[J]. Journal of South China University of Technology: Natural Science Edition, 2004, 32(Suppl 1): 84-88 (in Chinese)
    • (2004) Journal of South China University of Technology: Natural Science Edition , vol.32 , Issue.SUPPL.1 , pp. 84-88
    • Jing, T.1    Zuo, W.2
  • 29
    • 33846380652 scopus 로고    scopus 로고
    • Noise elimination method in Web pages based on the similarity of same layer ages
    • Chinese source
    • Yuan Mingxuan, Zhang Xuanping, Jiang Yu, et al. Noise elimination method in Web pages based on the similarity of same layer ages[J]. Computer Engineering, 2006, 32(23): 61-63 (in Chinese)
    • (2006) Computer Engineering , vol.32 , Issue.23 , pp. 61-63
    • Yuan, M.1    Zhang, X.2    Jiang, Y.3
  • 30
    • 33644844029 scopus 로고    scopus 로고
    • Framework of Web page analysis and content extraction with coordinate trees
    • Chinese source
    • Feng Huamin, Liu Biao, Liu Yanmin, et al. Framework of Web page analysis and content extraction with coordinate trees[J]. Journal of Tsinghua University: Science and Technology, 2005, 45(9): 1767-1771 (in Chinese)
    • (2005) Journal of Tsinghua University: Science and Technology , vol.45 , Issue.9 , pp. 1767-1771
    • Feng, H.1    Liu, B.2    Liu, Y.3
  • 33
    • 33947621611 scopus 로고    scopus 로고
    • Automatic identification of informative sections of Web pages
    • Debnath S, Mitra P, Pal N, et al. Automatic identification of informative sections of Web pages[J]. IEEE Trans on Knowledge and Data Engineering, 2005, 2(17): 1233-1246
    • (2005) IEEE Trans on Knowledge and Data Engineering , vol.2 , Issue.17 , pp. 1233-1246
    • Debnath, S.1    Mitra, P.2    Pal, N.3
  • 34
    • 19944413623 scopus 로고    scopus 로고
    • WISDOM: Web intrapage informative structure mining based on document object model
    • Kao H Y, Ho J M, Chen M S. WISDOM: Web intrapage informative structure mining based on document object model[J]. IEEE Trans on Knowledge and Data Engineering, 2005, 2(17): 614-627
    • (2005) IEEE Trans on Knowledge and Data Engineering , vol.2 , Issue.17 , pp. 614-627
    • Kao, H.Y.1    Ho, J.M.2    Chen, M.S.3
  • 35
    • 78650929867 scopus 로고    scopus 로고
    • Web pages noise removal based on focused topics
    • Chinese source
    • Wan Le, Zuo Wanli, Gao Jin. Web pages noise removal based on focused topics[J]. Computer Engineering and Design, 2008, 29(8): 2072-2076 (in Chinese)
    • (2008) Computer Engineering and Design , vol.29 , Issue.8 , pp. 2072-2076
    • Wan, L.1    Zuo, W.2    Gao, J.3
  • 37
    • 0007457124 scopus 로고    scopus 로고
    • Algorithms for a class of isotonic regression problems
    • Pardalos P M, Xue G. Algorithms for a class of isotonic regression problems[J]. Algorithmica, 1999, 2(23): 211-222
    • (1999) Algorithmica , vol.2 , Issue.23 , pp. 211-222
    • Pardalos, P.M.1    Xue, G.2
  • 39
    • 34249957970 scopus 로고
    • Active set algorithms for isotonic regression: A unifying framework
    • Best M J, Chakravarti N. Active set algorithms for isotonic regression: A unifying framework[J]. Mathematical Programming, 1990, 47(1): 425-439
    • (1990) Mathematical Programming , vol.47 , Issue.1 , pp. 425-439
    • Best, M.J.1    Chakravarti, N.2
  • 40
    • 0027855159 scopus 로고
    • An O(n̂3 log n) strong polynomial algorithm for an isotonic regression knapsack problem
    • Best M J, Tan R Y. An O(n̂3 log n) strong polynomial algorithm for an isotonic regression knapsack problem[J]. Optimization Theory and Applications, 1993, 79(3): 463-478
    • (1993) Optimization Theory and Applications , vol.79 , Issue.3 , pp. 463-478
    • Best, M.J.1    Tan, R.Y.2
  • 41
    • 51849102892 scopus 로고    scopus 로고
    • Primary content extraction with mountain model
    • Piscataway, NJ: IEEE
    • Bing L, Wang Y, Zhang Y, et al. Primary content extraction with mountain model[C]//Proc of the IEEE CIT2008. Piscataway, NJ: IEEE, 2008: 479-484
    • (2008) Proc of the IEEE CIT2008 , pp. 479-484
    • Bing, L.1    Wang, Y.2    Zhang, Y.3
  • 42
    • 34250678139 scopus 로고    scopus 로고
    • Verifying genre-based clustering approach to content extraction
    • New York: ACM
    • Gupta S, Becker H, Kaiser G, et al. Verifying genre-based clustering approach to content extraction[C]//Proc of the 15th Int Conf on World Wide Web. New York: ACM, 2006: 875-876
    • (2006) Proc of the 15th Int Conf on World Wide Web , pp. 875-876
    • Gupta, S.1    Becker, H.2    Kaiser, G.3
  • 44
    • 33644841529 scopus 로고    scopus 로고
    • Topic information extraction from template Web pages
    • Chinese source
    • Ou Jianwen, Dong Shoubin, Cai Bin. Topic information extraction from template Web pages[J]. Journal of Tsinghua University: Science and Technology, 2005, 45(9): 1743-1747 (in Chinese)
    • (2005) Journal of Tsinghua University: Science and Technology , vol.45 , Issue.9 , pp. 1743-1747
    • Ou, J.1    Dong, S.2    Cai, B.3
  • 45
    • 78650936753 scopus 로고    scopus 로고
    • CWT: Chinese Web test collection
    • Chinese source. 2004-06, 2009-11-25
    • Peking University. CWT: Chinese Web Test Collection[EB/OL]. (2004-06) [2009-11-25]. http://www.cwirf.org/SharedRes/DataSet/cwt.html (in Chinese)
  • 46
    • 78650928406 scopus 로고    scopus 로고
    • A block-analysis-based approach to eliminate noise in Web pages
    • Chinese source
    • Liu Chenxi, Wu Yangyang. A block-analysis-based approach to eliminate noise in Web pages[J]. Journal of Guangxi Normal University (Natural Science Edition), 2007, 25(2): 61-63 (in Chinese)
    • (2007) Journal of Guangxi Normal University (Natural Science Edition) , vol.25 , Issue.2 , pp. 61-63
    • Liu, C.1    Wu, Y.2
  • 47
    • 78650958363 scopus 로고    scopus 로고
    • The data resources of Sogou labs
    • Chinese source. 2006-11-02, 2007-04-05
    • Sougou Labs. The data resources of Sogou labs[EB/OL]. (2006-11-02) [2007-04-05]. http://www.sogou.com/labs/resources.html (in Chinese)
  • 48
    • 78650932375 scopus 로고    scopus 로고
    • Web page text information extraction and result estimation
    • Chinese source
    • Zhang Heng, Qu Jinghui, Zhang Liang. Web page text information extraction and result estimation[J]. Microcomputer Applications, 2007, 28(9): 921-924 (in Chinese)
    • (2007) Microcomputer Applications , vol.28 , Issue.9 , pp. 921-924
    • Zhang, H.1    Qu, J.2    Zhang, L.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.