메뉴 건너뛰기




Volumn , Issue , 2010, Pages 611-620

A pattern tree-based approach to learning URL normalization rules

Author keywords

url deduplication; url normalization; url pattern

Indexed keywords

CANONICAL FORM; DEDUPLICATION; EFFICIENT ALGORITHM; ENGINEERING PERSPECTIVE; GLOBAL PERSPECTIVE; INDEX COMPRESSION; LEARNING PROCESS; OFFLINE; PATTERN TREES; REWRITE RULES; STATISTICAL INFORMATION; TRAINING DATA; TRAINING SAMPLE; TRAINING SETS; URL NORMALIZATION;

EID: 77954589296     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1145/1772690.1772753     Document Type: Conference Paper
Times cited : (25)

References (18)
  • 3
    • 79953110433 scopus 로고    scopus 로고
    • URL Normalization. http://en.wikipedia.org/wiki/URL normalization.
    • URL Normalization
  • 5
    • 37549058056 scopus 로고    scopus 로고
    • Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
    • A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117-122, 2008.
    • (2008) Commun. ACM , vol.51 , Issue.1 , pp. 117-122
    • Andoni, A.1    Indyk, P.2
  • 6
    • 84961379302 scopus 로고
    • Finding patterns common to a set of strings
    • D. Angluin. Finding patterns common to a set of strings. In SOTC, pages 130-141, 1979.
    • (1979) SOTC , pp. 130-141
    • Angluin, D.1
  • 7
    • 35348921241 scopus 로고    scopus 로고
    • Do not crawl in the dust: Different URLs with similar text
    • Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the dust: different URLs with similar text. In WWW, pages 111-120, 2007.
    • (2007) WWW , pp. 111-120
    • Bar-Yossef, Z.1    Keidar, I.2    Schonfeld, U.3
  • 8
    • 0038589165 scopus 로고    scopus 로고
    • The anatomy of a large-scale hypertextual web search engine
    • S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1-7):107-117, 1998.
    • (1998) Computer Networks , vol.30 , Issue.1-7 , pp. 107-117
    • Brin, S.1    Page, L.2
  • 10
    • 34248158280 scopus 로고    scopus 로고
    • A cost-effective method for detecting web site replicas on search engine databases
    • A. C. Carvalho, E. S. Moura, A. S. Silva, K. Berlt, and A. Bezerra. A cost-effective method for detecting web site replicas on search engine databases. Data Knowl. Eng., 62(3):421-437, 2007.
    • (2007) Data Knowl. Eng. , vol.62 , Issue.3 , pp. 421-437
    • Carvalho, A.C.1    Moura, E.S.2    Silva, A.S.3    Berlt, K.4    Bezerra, A.5
  • 11
    • 0036040277 scopus 로고    scopus 로고
    • Similarity estimation techniques from rounding algorithms
    • M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. SOTC, pages 380-388, 2002.
    • (2002) Proc. SOTC , pp. 380-388
    • Charikar, M.1
  • 12
    • 0013206133 scopus 로고    scopus 로고
    • Collection statistics for fast duplicate document detection
    • A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. TOIS, 20(2):171-191, 2002.
    • (2002) TOIS , vol.20 , Issue.2 , pp. 171-191
    • Chowdhury, A.1    Frieder, O.2    Grossman, D.A.3    McCabe, M.C.4
  • 13
    • 65449167674 scopus 로고    scopus 로고
    • De-duping URLs via rewrite rules
    • A. Dasgupta, R. Kumar, and A. Sasturkar. De-duping URLs via rewrite rules. In KDD, pages 186-194, 2008.
    • (2008) KDD , pp. 186-194
    • Dasgupta, A.1    Kumar, R.2    Sasturkar, A.3
  • 14
    • 33750296887 scopus 로고    scopus 로고
    • Finding near-duplicate web pages: A large-scale evaluation of algorithms
    • M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284-291, 2006.
    • (2006) SIGIR , pp. 284-291
    • Henzinger, M.1
  • 15
    • 35348911985 scopus 로고    scopus 로고
    • Detecting near-duplicates for web crawling
    • G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. WWW, pages 141-150, 2007.
    • (2007) Proc. WWW , pp. 141-150
    • Manku, G.S.1    Jain, A.2    Sarma, A.D.3
  • 16
    • 85017291425 scopus 로고    scopus 로고
    • Systems and methods for inferring uniform resource locator (URL) normalization rules
    • US Patent Application Publication, 2006/0218143, Microsoft Corporation
    • M. Najork. Systems and methods for inferring uniform resource locator (URL) normalization rules. US Patent Application Publication, 2006/0218143, Microsoft Corporation, 2006.
    • (2006)
    • Najork, M.1
  • 17
    • 0003676885 scopus 로고
    • Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University
    • M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
    • (1981) Fingerprinting by Random Polynomials
    • Rabin, M.O.1
  • 18
    • 66249113620 scopus 로고    scopus 로고
    • Efficient similarity joins for near duplicate detection
    • C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proc. WWW, pages 131-140, 2008.
    • (2008) Proc. WWW , pp. 131-140
    • Xiao, C.1    Wang, W.2    Lin, X.3    Yu, J.X.4


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.