메뉴 건너뛰기




Volumn 70, Issue , 2014, Pages 301-323

Web data extraction, applications and techniques: A survey

Author keywords

Business intelligence; Information retrieval; Knowledge engineering; Knowledge based systems; Web data mining; Web information extraction

Indexed keywords

BEHAVIORAL RESEARCH; COMPETITION; COMPETITIVE INTELLIGENCE; INFORMATION RETRIEVAL; KNOWLEDGE BASED SYSTEMS; KNOWLEDGE ENGINEERING; SEARCH ENGINES; SOCIAL NETWORKING (ONLINE); SURVEYS;

EID: 84908485191     PISSN: 09507051     EISSN: None     Source Type: Journal    
DOI: 10.1016/j.knosys.2014.07.007     Document Type: Article
Times cited : (264)

References (130)
  • 13
    • 0002652285 scopus 로고    scopus 로고
    • A maximum entropy approach to natural language processing
    • A. Berger, V.D. Pietra, and S.D. Pietra A maximum entropy approach to natural language processing Comput. Linguist. 22 1 1996 39 71
    • (1996) Comput. Linguist. , vol.22 , Issue.1 , pp. 39-71
    • Berger, A.1    Pietra, V.D.2    Pietra, S.D.3
  • 15
    • 33645136546 scopus 로고    scopus 로고
    • The power of a good idea: Quantitative modeling of the spread of ideas from epidemiological models
    • L. Bettencourt, A. Cintrón-Arias, D. Kaiser, and C. Castillo-Chávez The power of a good idea: quantitative modeling of the spread of ideas from epidemiological models Phys. A: Stat. Mech. Appl. 364 2006 513 536
    • (2006) Phys. A: Stat. Mech. Appl. , vol.364 , pp. 513-536
    • Bettencourt, L.1    Cintrón-Arias, A.2    Kaiser, D.3    Castillo-Chávez, C.4
  • 16
    • 2342598419 scopus 로고    scopus 로고
    • Bottom-up relational learning of pattern matching rules for information extraction
    • M. Califf, and R. Mooney Bottom-up relational learning of pattern matching rules for information extraction J. Machine Learning Res. 4 2003 177 210
    • (2003) J. Machine Learning Res. , vol.4 , pp. 177-210
    • Califf, M.1    Mooney, R.2
  • 22
    • 0036885533 scopus 로고    scopus 로고
    • Ci spider: A tool for competitive intelligence on the web
    • H. Chen, M. Chau, and D. Zeng Ci spider: a tool for competitive intelligence on the web Decis. Support Syst. 34 1 2002 1 17
    • (2002) Decis. Support Syst. , vol.34 , Issue.1 , pp. 1-17
    • Chen, H.1    Chau, M.2    Zeng, D.3
  • 23
    • 0345861203 scopus 로고    scopus 로고
    • New algorithm for ordered tree-to-tree correction problem
    • W. Chen New algorithm for ordered tree-to-tree correction problem J. Algor. 40 2 2001 135 158
    • (2001) J. Algor. , vol.40 , Issue.2 , pp. 135-158
    • Chen, W.1
  • 24
    • 84941301490 scopus 로고    scopus 로고
    • A new statistical parser based on bigram lexical dependencies
    • Association for Computational Linguistics
    • M. Collins A new statistical parser based on bigram lexical dependencies Proc. 34th annual meeting on Association for Computational Linguistics 1996 Association for Computational Linguistics 184 191
    • (1996) Proc. 34th Annual Meeting on Association for Computational Linguistics , pp. 184-191
    • Collins, M.1
  • 25
    • 84874631930 scopus 로고    scopus 로고
    • The geospatial characteristics of a social movement communication network
    • M.D. Conover, C. Davis, E. Ferrara, K. McKelvey, F. Menczer, and A. Flammini The geospatial characteristics of a social movement communication network PloS One 8 3 2013 e55957
    • (2013) PloS One , vol.8 , Issue.3 , pp. 55957
    • Conover, M.D.1    Davis, C.2    Ferrara, E.3    McKelvey, K.4    Menczer, F.5    Flammini, A.6
  • 27
    • 12344333240 scopus 로고    scopus 로고
    • Automatic information extraction from large websites
    • V. Crescenzi, and G. Mecca Automatic information extraction from large websites J. ACM 51 5 2004 731 779
    • (2004) J. ACM , vol.51 , Issue.5 , pp. 731-779
    • Crescenzi, V.1    Mecca, G.2
  • 28
    • 84944327150 scopus 로고    scopus 로고
    • Roadrunner: Towards automatic data extraction from large web sites
    • Morgan Kaufman Publishers Inc. San Francisco, CA, USA
    • V. Crescenzi, G. Mecca, and P. Merialdo Roadrunner: towards automatic data extraction from large web sites Proc. 27th International Conference on Very Large Data Bases 2001 Morgan Kaufman Publishers Inc. San Francisco, CA, USA 109 118
    • (2001) Proc. 27th International Conference on Very Large Data Bases , pp. 109-118
    • Crescenzi, V.1    Mecca, G.2    Merialdo, P.3
  • 29
    • 84873902041 scopus 로고    scopus 로고
    • Improving the expressiveness of roadrunner
    • V. Crescenzi, G. Mecca, P. Merialdo, Improving the expressiveness of roadrunner, in: SEBD, 2004, pp. 62-69.
    • (2004) SEBD , pp. 62-69
    • Crescenzi, V.1    Mecca, G.2    Merialdo, P.3
  • 31
    • 84861026711 scopus 로고    scopus 로고
    • Automatic wrappers for large scale Web extraction
    • N. Dalvi, R. Kumar, and M. Soliman Automatic wrappers for large scale Web extraction Proc. VLDB Endowment 4 4 2011 219 230
    • (2011) Proc. VLDB Endowment , vol.4 , Issue.4 , pp. 219-230
    • Dalvi, N.1    Kumar, R.2    Soliman, M.3
  • 32
    • 9444244198 scopus 로고    scopus 로고
    • Mining the peanut gallery: Opinion extraction and semantic classification of product reviews
    • Budapest, Hungary
    • K. Dave, S. Lawrence, D. Pennock, Mining the peanut gallery: opinion extraction and semantic classification of product reviews, in: Proc. of the 12th International Conference on World Wide Web, Budapest, Hungary, 2003, pp. 519-528.
    • (2003) Proc. of the 12th International Conference on World Wide Web , pp. 519-528
    • Dave, K.1    Lawrence, S.2    Pennock, D.3
  • 35
    • 70349665368 scopus 로고    scopus 로고
    • Position paper: Secure infrastructure for scientific data life cycle management
    • M. Descher, T. Feilhauer, T. Ludescher, P. Masser, B. Wenzel, P. Brezany, I. Elsayed, A. Wöhrer, A.M. Tjoa, D. Huemer, Position paper: secure infrastructure for scientific data life cycle management, in: ARES, 2009, pp. 606-611.
    • (2009) ARES , pp. 606-611
    • Descher, M.1
  • 38
    • 84875863990 scopus 로고    scopus 로고
    • A large-scale community structure analysis in facebook
    • E. Ferrara A large-scale community structure analysis in facebook EPJ Data Sci. 1 9 2012 1 30
    • (2012) EPJ Data Sci. , vol.1 , Issue.9 , pp. 1-30
    • Ferrara, E.1
  • 46
    • 0033907729 scopus 로고    scopus 로고
    • Machine learning for information extraction in informal domains
    • D. Freitag Machine learning for information extraction in informal domains Machine Learning 39 2 2000 169 202
    • (2000) Machine Learning , vol.39 , Issue.2 , pp. 169-202
    • Freitag, D.1
  • 48
    • 84861058689 scopus 로고    scopus 로고
    • OXPath: A language for scalable, memory-efficient data extraction from web applications
    • T. Furche, G. Gottlob, G. Grasso, C. Schallhart, and A. Sellers OXPath: a language for scalable, memory-efficient data extraction from web applications Proc. VLDB Endowment 4 11 2011 1016 1027
    • (2011) Proc. VLDB Endowment , vol.4 , Issue.11 , pp. 1016-1027
    • Furche, T.1    Gottlob, G.2    Grasso, G.3    Schallhart, C.4    Sellers, A.5
  • 55
    • 4444361310 scopus 로고    scopus 로고
    • Logic-based web information extraction
    • G. Gottlob, and C. Koch Logic-based web information extraction SIGMOD Rec. 33 2 2004 87 94
    • (2004) SIGMOD Rec. , vol.33 , Issue.2 , pp. 87-94
    • Gottlob, G.1    Koch, C.2
  • 56
    • 3142784673 scopus 로고    scopus 로고
    • Monadic datalog and the expressive power of languages for web information extraction
    • G. Gottlob, and C. Koch Monadic datalog and the expressive power of languages for web information extraction J. ACM 51 1 2004 74 113
    • (2004) J. ACM , vol.51 , Issue.1 , pp. 74-113
    • Gottlob, G.1    Koch, C.2
  • 59
    • 79958085453 scopus 로고    scopus 로고
    • Tweets from Justin Bieber's heart: The dynamics of the location field in user profiles
    • ACM Vancouver, British Columbia, Canada
    • B. Hecht, L. Hong, B. Suh, and E. Chi Tweets from Justin Bieber's heart: the dynamics of the location field in user profiles Proc. International Conference on Human Factors in Computing Systems 2011 ACM Vancouver, British Columbia, Canada 237 246
    • (2011) Proc. International Conference on Human Factors in Computing Systems , pp. 237-246
    • Hecht, B.1    Hong, L.2    Suh, B.3    Chi, E.4
  • 60
    • 0032309862 scopus 로고    scopus 로고
    • Generating finite-state transducers for semi-structured data extraction from the web
    • C.-N. Hsu, and M.-T. Dung Generating finite-state transducers for semi-structured data extraction from the web Inf. Syst. 23 9 1998 521 538
    • (1998) Inf. Syst. , vol.23 , Issue.9 , pp. 521-538
    • Hsu, C.-N.1    Dung, M.-T.2
  • 63
  • 64
  • 68
    • 0033726520 scopus 로고    scopus 로고
    • The small-world phenomenon: An algorithm perspective
    • ACM Portland, Oregon, USA
    • J. Kleinberg The small-world phenomenon: an algorithm perspective Proc. ACM symposium on Theory of Computing 2000 ACM Portland, Oregon, USA 163 170
    • (2000) Proc. ACM Symposium on Theory of Computing , pp. 163-170
    • Kleinberg, J.1
  • 69
    • 55149112127 scopus 로고    scopus 로고
    • The convergence of social and technological networks
    • J. Kleinberg The convergence of social and technological networks Commun. ACM 51 11 2008 66 72
    • (2008) Commun. ACM , vol.51 , Issue.11 , pp. 66-72
    • Kleinberg, J.1
  • 75
    • 80054792505 scopus 로고    scopus 로고
    • A versatile model for web page representation, information extraction and content re-packaging
    • B. Krüpl-Sypien, R.R. Fayzrakhmanov, W. Holzinger, M. Panzenböck, R. Baumgartner, A versatile model for web page representation, information extraction and content re-packaging, in: ACM Symposium on Document Engineering, 2011, pp. 129-138.
    • (2011) ACM Symposium on Document Engineering , pp. 129-138
    • Krüpl-Sypien, B.1
  • 78
    • 0034172374 scopus 로고    scopus 로고
    • Wrapper induction: Efficiency and expressiveness
    • N. Kushmerick Wrapper induction: efficiency and expressiveness Artif. Intell. 118 1-2 2000 15 68
    • (2000) Artif. Intell. , vol.118 , Issue.12 , pp. 15-68
    • Kushmerick, N.1
  • 82
    • 35348870148 scopus 로고    scopus 로고
    • Structured data extraction: Wrapper generation
    • B. Liu Structured data extraction: wrapper generation Web Data Min. 2011 363 423
    • (2011) Web Data Min. , pp. 363-423
    • Liu, B.1
  • 85
    • 84882262432 scopus 로고    scopus 로고
    • Web archiving: Issues and methods
    • J. Masanès Web archiving: issues and methods Web Archiv. 2006 1 53
    • (2006) Web Archiv. , pp. 1-53
    • Masanès, J.1
  • 86
    • 34250615860 scopus 로고    scopus 로고
    • Folksonomies-cooperative classification and communication through shared metadata
    • A. Mathes Folksonomies-cooperative classification and communication through shared metadata Comput. Mediat. Commun. 47 10 2004
    • (2004) Comput. Mediat. Commun. , vol.47 , Issue.10
    • Mathes, A.1
  • 87
    • 84958652430 scopus 로고    scopus 로고
    • A unified framework for wrapping, mediating and restructuring information from the Web
    • W. May, R. Himmeröder, G. Lausen, B. Ludäscher, A unified framework for wrapping, mediating and restructuring information from the Web, in: Advances in Conceptual Modeling, Sprg. LNCS 1727, 1999, pp. 307-320.
    • (1999) Advances in Conceptual Modeling, Sprg. LNCS , vol.1727 , pp. 307-320
    • May, W.1
  • 91
    • 0002089617 scopus 로고    scopus 로고
    • Matching algorithm within a duplicate detection system
    • A.E. Monge Matching algorithm within a duplicate detection system IEEE Tech. Bull. Data Eng. 23 4 2000
    • (2000) IEEE Tech. Bull. Data Eng. , vol.23 , Issue.4
    • Monge, A.E.1
  • 95
    • 80052221516 scopus 로고    scopus 로고
    • Sxpath - Extending xpath towards spatial querying on web documents
    • E. Oro, M. Ruffolo, and S. Staab Sxpath - extending xpath towards spatial querying on web documents PVLDB 4 2 2010 129 140
    • (2010) PVLDB , vol.4 , Issue.2 , pp. 129-140
    • Oro, E.1    Ruffolo, M.2    Staab, S.3
  • 96
    • 77956031473 scopus 로고    scopus 로고
    • A survey on transfer learning
    • S. Pan, and Q. Yang A survey on transfer learning IEEE Trans. Knowl. Data Eng. 22 10 2010 1345 1359
    • (2010) IEEE Trans. Knowl. Data Eng. , vol.22 , Issue.10 , pp. 1345-1359
    • Pan, S.1    Yang, Q.2
  • 98
    • 34250167284 scopus 로고    scopus 로고
    • Automated data extraction from the web with conditional models
    • X.H. Phan, S. Horiguchi, and T. Ho Automated data extraction from the web with conditional models Int. J. Bus. Intell. Data Min. 1 2 2005 194 209
    • (2005) Int. J. Bus. Intell. Data Min. , vol.1 , Issue.2 , pp. 194-209
    • Phan, X.H.1    Horiguchi, S.2    Ho, T.3
  • 100
    • 0035657983 scopus 로고    scopus 로고
    • A survey of approaches to automatic schema matching
    • E. Rahm, and P. Bernstein A survey of approaches to automatic schema matching The VLDB J. 10 4 2001 334 350
    • (2001) The VLDB J. , vol.10 , Issue.4 , pp. 334-350
    • Rahm, E.1    Bernstein, P.2
  • 101
    • 0002490026 scopus 로고    scopus 로고
    • Data cleaning: Problems and current approaches
    • E. Rahm, and H.H. Do Data cleaning: Problems and current approaches IEEE Bull. Data Eng. 23 4 2000
    • (2000) IEEE Bull. Data Eng. , vol.23 , Issue.4
    • Rahm, E.1    Do, H.H.2
  • 103
    • 84890758494 scopus 로고    scopus 로고
    • The directed closure process in hybrid social-information networks, with an analysis of link formation on twitter
    • D. Romero, J. Kleinberg, The directed closure process in hybrid social-information networks, with an analysis of link formation on twitter, in: Proc. 4th International Conference on Weblogs and Social Media, 2010.
    • (2010) Proc. 4th International Conference on Weblogs and Social Media
    • Romero, D.1    Kleinberg, J.2
  • 104
    • 0002763572 scopus 로고    scopus 로고
    • Building light-weight wrappers for legacy web data-sources using w4f
    • Morgan Kaufmann Publishers Inc. San Francisco, CA, USA
    • A. Sahuguet, and F. Azavant Building light-weight wrappers for legacy web data-sources using w4f Proc. 25th International Conference on Very Large Data Bases 1999 Morgan Kaufmann Publishers Inc. San Francisco, CA, USA 738 741
    • (1999) Proc. 25th International Conference on Very Large Data Bases , pp. 738-741
    • Sahuguet, A.1    Azavant, F.2
  • 105
    • 84868288681 scopus 로고    scopus 로고
    • Information extraction
    • S. Sarawagi Information extraction Found. Trends Databases 1 3 2008 261 377
    • (2008) Found. Trends Databases , vol.1 , Issue.3 , pp. 261-377
    • Sarawagi, S.1
  • 107
    • 0001122858 scopus 로고
    • The tree-to-tree editing problem
    • S. Selkow The tree-to-tree editing problem Inform. Process. Lett. 6 6 1977 184 186
    • (1977) Inform. Process. Lett. , vol.6 , Issue.6 , pp. 184-186
    • Selkow, S.1
  • 108
    • 79960323450 scopus 로고    scopus 로고
    • Tags as bridges between domains: Improving recommendation with tag-induced cross-domain collaborative filtering
    • Lecture Notes in Computer Science Springer Girona, Spain
    • Y. Shi, M. Larson, and A. Hanjalic Tags as bridges between domains: improving recommendation with tag-induced cross-domain collaborative filtering Proc. International Conference on User Modeling, Adaption and Personalization Lecture Notes in Computer Science 2011 Springer Girona, Spain 305 316
    • (2011) Proc. International Conference on User Modeling, Adaption and Personalization , pp. 305-316
    • Shi, Y.1    Larson, M.2    Hanjalic, A.3
  • 109
    • 0032624184 scopus 로고    scopus 로고
    • Learning information extraction rules for semi-structured and free text
    • S. Soderland Learning information extraction rules for semi-structured and free text Machine Learn. 34 1 1999 233 272
    • (1999) Machine Learn. , vol.34 , Issue.1 , pp. 233-272
    • Soderland, S.1
  • 117
    • 0033908281 scopus 로고    scopus 로고
    • Users' interaction with world wide web resources: An exploratory study using a holistic approach
    • P. Wang, W. Hawk, and C. Tenopir Users' interaction with world wide web resources: an exploratory study using a holistic approach Inf. Process. Manage. 36 2000 229 251
    • (2000) Inf. Process. Manage. , vol.36 , pp. 229-251
    • Wang, P.1    Hawk, W.2    Tenopir, C.3
  • 120
    • 84860860067 scopus 로고    scopus 로고
    • The state of record linkage and current research problems
    • US Census Bureau
    • W. Winkler, The state of record linkage and current research problems, in: Statistical Research Division, US Census Bureau, 1999.
    • (1999) Statistical Research Division
    • Winkler, W.1
  • 122
    • 0026185673 scopus 로고
    • Identifying syntactic differences between two programs
    • W. Yang Identifying syntactic differences between two programs Softw. - Pract. Exp. 21 7 1991 739 755
    • (1991) Softw. - Pract. Exp. , vol.21 , Issue.7 , pp. 739-755
    • Yang, W.1
  • 124
  • 125
    • 0038590968 scopus 로고    scopus 로고
    • Competitive intelligence through data mining public sources
    • Wiley New York, NY ETATS-UNIS (1990-2001) (Revue)
    • A. Zanasi Competitive intelligence through data mining public sources Competitive Intelligence Review vol. 9 1998 Wiley New York, NY 44 54 ETATS-UNIS (1990-2001) (Revue)
    • (1998) Competitive Intelligence Review , vol.9 , pp. 44-54
    • Zanasi, A.1
  • 127
    • 33750797710 scopus 로고    scopus 로고
    • Structured data extraction from the web based on partial tree alignment
    • Y. Zhai, and B. Liu Structured data extraction from the web based on partial tree alignment IEEE Trans. Knowl. Data Eng. 18 12 2006 1614 1628
    • (2006) IEEE Trans. Knowl. Data Eng. , vol.18 , Issue.12 , pp. 1614-1628
    • Zhai, Y.1    Liu, B.2
  • 128
    • 0000307499 scopus 로고
    • On the editing distance between unordered labeled trees
    • K. Zhang, R. Statman, and D. Shasha On the editing distance between unordered labeled trees Inform. Process. Lett. 42 3 1992 133 139
    • (1992) Inform. Process. Lett. , vol.42 , Issue.3 , pp. 133-139
    • Zhang, K.1    Statman, R.2    Shasha, D.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.