메뉴 건너뛰기




Volumn , Issue , 2007, Pages 111-120

Do not crawl in the dust: Different urls with similar text

Author keywords

Anti aliasing; Crawling; Duplicate detection; Search engines; URL normalization

Indexed keywords

INFORMATION ANALYSIS; PROBLEM SOLVING; SEARCH ENGINES; SERVERS; STATISTICAL METHODS; WEBSITES;

EID: 35348921241     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1145/1242572.1242588     Document Type: Conference Paper
Times cited : (41)

References (22)
  • 1
    • 0001882616 scopus 로고
    • Fast algorithms for mining association rules
    • R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 20th VLDB, pages 487-499, 1994.
    • (1994) Proc. 20th VLDB , pp. 487-499
    • Agrawal, R.1    Srikant, R.2
  • 2
    • 35348814715 scopus 로고    scopus 로고
    • Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different URLs with similar text. Technical Report CCIT Report #601, Dept. Electrical Engineering, Technion, 2006.
    • Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different URLs with similar text. Technical Report CCIT Report #601, Dept. Electrical Engineering, Technion, 2006.
  • 3
    • 0033297070 scopus 로고    scopus 로고
    • Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content
    • K. Bharat and A. Z. Broder. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content. Computer Networks, 31(11-16):1579-1590, 1999.
    • (1999) Computer Networks , vol.31 , Issue.11-16 , pp. 1579-1590
    • Bharat, K.1    Broder, A.Z.2
  • 4
    • 0742329413 scopus 로고    scopus 로고
    • A comparison of techniques to find mirrored hosts on the WWW
    • K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. IEEE Data Engin. Bull., 23(4):21-26, 2000.
    • (2000) IEEE Data Engin. Bull , vol.23 , Issue.4 , pp. 21-26
    • Bharat, K.1    Broder, A.Z.2    Dean, J.3    Henzinger, M.R.4
  • 6
    • 84976810280 scopus 로고
    • Copy Detection Mechanisms for Digital Documents
    • S. Brin, J. Davis, and H. Garcia-Molina. Copy Detection Mechanisms for Digital Documents. In Proc. 14th SIGMOD, pages 398-409, 1995.
    • (1995) Proc. 14th SIGMOD , pp. 398-409
    • Brin, S.1    Davis, J.2    Garcia-Molina, H.3
  • 11
    • 0030394823 scopus 로고    scopus 로고
    • H. Garcia-Molina, L. Gravano, and N. Shivakumar. dscam: Finding document copies across multiple databases. In Proc. 4th PDIS, pages 68-79, 1996.
    • H. Garcia-Molina, L. Gravano, and N. Shivakumar. dscam: Finding document copies across multiple databases. In Proc. 4th PDIS, pages 68-79, 1996.
  • 13
    • 35348817110 scopus 로고    scopus 로고
    • Google Inc
    • Google Inc. Google sitemaps. http://sitemaps.google.com.
    • Google sitemaps
  • 15
    • 0037319544 scopus 로고    scopus 로고
    • Methods for identifying versioned and plagiarized documents
    • T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. J. Amer. Soc. Infor. Sci. Tech., 54(3):203-215, 2003.
    • (2003) J. Amer. Soc. Infor. Sci. Tech , vol.54 , Issue.3 , pp. 203-215
    • Hoad, T.C.1    Zobel, J.2
  • 16
    • 35348902310 scopus 로고    scopus 로고
    • Using bloom filters to refine web search results
    • N. Jain, M. Dahlin, and R. Tewari. Using bloom filters to refine web search results. In Proc. 7th WebDB, pages 25-30, 2005.
    • (2005) Proc. 7th WebDB , pp. 25-30
    • Jain, N.1    Dahlin, M.2    Tewari, R.3
  • 17
    • 38949210729 scopus 로고    scopus 로고
    • Aliasing on the world wide web: Prevalence and performance implications
    • T. Kelly and J. C. Mogul. Aliasing on the world wide web: prevalence and performance implications. In Proc. 11th WWW, pages 281-292, 2002.
    • (2002) Proc. 11th WWW , pp. 281-292
    • Kelly, T.1    Mogul, J.C.2
  • 18
    • 33745966135 scopus 로고    scopus 로고
    • Reliable evaluations of URL normalization
    • S. J. Kim, H. S. Jeong, and S. H. Lee. Reliable evaluations of URL normalization. In Proc. 4th ICCSA, pages 609-617, 2006.
    • (2006) Proc. 4th ICCSA , pp. 609-617
    • Kim, S.J.1    Jeong, H.S.2    Lee, S.H.3
  • 20
    • 34247390104 scopus 로고    scopus 로고
    • Evaluation of crawling policies for a web-repository crawler
    • F. McCown and M. L. Nelson. Evaluation of crawling policies for a web-repository crawler. In Proc. 17th HYPERTEXT, pages 157-168, 2006.
    • (2006) Proc. 17th HYPERTEXT , pp. 157-168
    • McCown, F.1    Nelson, M.L.2
  • 21
    • 34250618783 scopus 로고    scopus 로고
    • Do not crawl in the DUST: Different URLs with similar text
    • U. Schonfeld, Z. Bar-Yossef, and I. Keidar. Do not crawl in the DUST: different URLs with similar text. In Proc. 15th WWW, pages 1015-1016, 2006.
    • (2006) Proc. 15th WWW , pp. 1015-1016
    • Schonfeld, U.1    Bar-Yossef, Z.2    Keidar, I.3
  • 22
    • 84956971810 scopus 로고    scopus 로고
    • Finding Near-Replicas of Documents and Servers on the Web
    • N. Shivakumar and H. Garcia-Molina. Finding Near-Replicas of Documents and Servers on the Web. In Proc. 1st WebDB, pages 204-212, 1998.
    • (1998) Proc. 1st WebDB , pp. 204-212
    • Shivakumar, N.1    Garcia-Molina, H.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.