메뉴 건너뛰기




Volumn , Issue , 2007, Pages 141-150

Detecting near-duplicates for web crawling

Author keywords

Fingerprint; Hamming distance; Near duplicate; Search; Similarity; Sketch; Web crawl; Web document

Indexed keywords

ALGORITHMIC TECHNIQUES; WEB CRAWL; WEB DOCUMENT;

EID: 35348911985     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1145/1242572.1242592     Document Type: Conference Paper
Times cited : (567)

References (53)
  • 2
  • 5
  • 6
    • 0033297070 scopus 로고    scopus 로고
    • Mirror, mirror on the Web: A study of hst pairs with replicated content
    • K. Bharat and A. Broder. Mirror, mirror on the Web: A study of hst pairs with replicated content. In Proc. 8th International Conference on World Wide Web (WWW 1999), pages 1579-1590, 1999.
    • (1999) Proc. 8th International Conference on World Wide Web , pp. 1579-1590
    • Bharat, K.1    Broder, A.2
  • 9
    • 0038589165 scopus 로고    scopus 로고
    • The anatomy of a large-scale hypertextual Web search engine
    • S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7): 107-117, 1998.
    • (1998) Computer Networks and ISDN Systems , vol.30 , Issue.1-7 , pp. 107-117
    • Brin, S.1    Page, L.2
  • 11
    • 0034227695 scopus 로고    scopus 로고
    • Improved bounds for dictionary look-up with one error
    • G. S. Brodai and S. Venkatesh. Improved bounds for dictionary look-up with one error. Information Processing Letters, 75(1-2):57-59, 2000.
    • (2000) Information Processing Letters , vol.75 , Issue.1-2 , pp. 57-59
    • Brodai, G.S.1    Venkatesh, S.2
  • 12
    • 35348864078 scopus 로고    scopus 로고
    • A. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences, 1998.
    • A. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences, 1998.
  • 22
    • 8644227073 scopus 로고    scopus 로고
    • Constructing a text corpus for inexact duplicate detection
    • July
    • J. G. Conrad and C. P. Schriber. Constructing a text corpus for inexact duplicate detection. In SIGIR 2004, pages 582-583, July 2004.
    • (2004) SIGIR 2004 , pp. 582-583
    • Conrad, J.G.1    Schriber, C.P.2
  • 25
    • 0033293618 scopus 로고    scopus 로고
    • Finding related pages in the World Wide Web
    • J. Dean and M. Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11-16):1467-1479, 1999.
    • (1999) Computer Networks , vol.31 , Issue.11-16 , pp. 1467-1479
    • Dean, J.1    Henzinger, M.2
  • 30
    • 0040152802 scopus 로고    scopus 로고
    • Efficient and tunable similar set retrieval
    • A. Gionis, D. Gunopulos, and N. Koudas. Efficient and tunable similar set retrieval. In Proc. SIGMOD 2001, pages 247-258, 2001.
    • (2001) Proc. SIGMOD 2001 , pp. 247-258
    • Gionis, A.1    Gunopulos, D.2    Koudas, N.3
  • 32
    • 13844267502 scopus 로고    scopus 로고
    • Efficient phrase-based document indexing for web document clustering
    • Aug
    • K. M. Hammouda and M. S. Kamel. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10):1279-1296, Aug. 2004.
    • (2004) IEEE Transactions on Knowledge and Data Engineering , vol.16 , Issue.10 , pp. 1279-1296
    • Hammouda, K.M.1    Kamel, M.S.2
  • 35
    • 33750296887 scopus 로고    scopus 로고
    • Finding near-duplicate web pages: A large-scale evaluation of algorithms
    • M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR. 2006, pages 284-291, 2006.
    • (2006) SIGIR. 2006 , pp. 284-291
    • Henzinger, M.R.1
  • 37
    • 84938015047 scopus 로고
    • A method for the construction of minimum-redundancy codes
    • Sept
    • D. A. Huffman. A method for the construction of minimum-redundancy codes. In Proc. Institute of Radio Engineering, volume 40, pages 1098-1102, Sept. 1952.
    • (1952) Proc. Institute of Radio Engineering , vol.40 , pp. 1098-1102
    • Huffman, D.A.1
  • 38
    • 2442561063 scopus 로고    scopus 로고
    • S. Joshi, N. Agrawal, R,. Krishnapuram, and S. Negi. A bag of paths model for measuring structural similarity in Web documents. In Proc. 9th ACM Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD 2003), pages 577-582, 2003.
    • S. Joshi, N. Agrawal, R,. Krishnapuram, and S. Negi. A bag of paths model for measuring structural similarity in Web documents. In Proc. 9th ACM Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD 2003), pages 577-582, 2003.
  • 39
    • 4243148480 scopus 로고    scopus 로고
    • Authoritative sources in a hyperlinked environment
    • Sept
    • J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, Sept. 1999.
    • (1999) Journal of the ACM , vol.46 , Issue.5 , pp. 604-632
    • Kleinberg, J.M.1
  • 40
    • 12244261882 scopus 로고    scopus 로고
    • Improved robustness of signature-based near-replica detection via lexicon randomization
    • Aug
    • A. Kolcz, A. Chowdhury, and J. Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. In SIGKDD 2004, pages 605-610, Aug. 2004.
    • (2004) SIGKDD 2004 , pp. 605-610
    • Kolcz, A.1    Chowdhury, A.2    Alspector, J.3
  • 42
    • 85043988965 scopus 로고
    • Finding similar files in a large file system
    • Jan
    • U. Manber. Finding similar files in a large file system. In Proc. 1994 USENIX Conference, pages 1-10, Jan. 1994.
    • (1994) Proc. 1994 USENIX Conference , pp. 1-10
    • Manber, U.1
  • 43
    • 0034794539 scopus 로고    scopus 로고
    • F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven web crawlers. In Proc. 24th Annual International ACM SIGIR Conference On Research and Development in Information Retrieval, pages 241-249, 2001.
    • F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven web crawlers. In Proc. 24th Annual International ACM SIGIR Conference On Research and Development in Information Retrieval, pages 241-249, 2001.
  • 46
    • 33745753308 scopus 로고    scopus 로고
    • User-centric web crawling
    • S. Pandey and C. Olston. User-centric web crawling. In Proc. 'WWW 2005, pages 401-411, 2005.
    • (2005) , pp. 401-411
    • Pandey, S.1    Olston, C.2
  • 47
    • 35348920411 scopus 로고    scopus 로고
    • W. Pugh and M. R. Henzinger. Detecting duplicate and near-duplicate files. United States Patent 6,658,423, granted on Dec 2, 2003, 2003.
    • W. Pugh and M. R. Henzinger. Detecting duplicate and near-duplicate files. United States Patent 6,658,423, granted on Dec 2, 2003, 2003.
  • 49
    • 0003676885 scopus 로고
    • Fingerprinting by random polynomials
    • TR.-15-81, Center for Research in Computing Techonlogy, Harvard University
    • M. O. Rabin. Fingerprinting by random polynomials. Technical Report Report TR.-15-81, Center for Research in Computing Techonlogy, Harvard University, 1981.
    • (1981) Technical Report Report
    • Rabin, M.O.1
  • 50
    • 1142267351 scopus 로고    scopus 로고
    • Winnowing: Local algorithms for document fingerprinting
    • June
    • S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proc. SIGMOD 2003, pages 76-85, June 2003.
    • (2003) Proc. SIGMOD 2003 , pp. 76-85
    • Schleimer, S.1    Wilkerson, D.S.2    Aiken, A.3
  • 53
    • 0012726646 scopus 로고    scopus 로고
    • Dictionary look-up with one error
    • A. C. Yao and F. F. Yao. Dictionary look-up with one error. J of Algorithms, 25(1):194-202, 1997.
    • (1997) J of Algorithms , vol.25 , Issue.1 , pp. 194-202
    • Yao, A.C.1    Yao, F.F.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.