메뉴 건너뛰기




Volumn , Issue , 2008, Pages 563-570

SpotSigs: Robust and efficient near duplicate detection in large web collections

Author keywords

High dimensional similarity search; Inverted index pruning; Optimal partitioning; Stopword signatures

Indexed keywords

INFORMATION RETRIEVAL; INFORMATION RETRIEVAL SYSTEMS; INFORMATION SERVICES; PATTERN MATCHING; RESEARCH AND DEVELOPMENT MANAGEMENT; WORLD WIDE WEB;

EID: 57349131623     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1145/1390334.1390431     Document Type: Conference Paper
Times cited : (147)

References (30)
  • 1
    • 38749118638 scopus 로고    scopus 로고
    • Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
    • A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, p. 459-468, 2006.
    • (2006) FOCS , pp. 459-468
    • Andoni, A.1    Indyk, P.2
  • 3
    • 38149034062 scopus 로고    scopus 로고
    • LSH forest: Self-tuning indexes for similarity search
    • M. Bawa, T. Condie, and P. Ganesan. LSH forest: self-tuning indexes for similarity search. In WWW, p. 651-660, 2005.
    • (2005) , pp. 651-660
    • Bawa, M.1    Condie, T.2    Ganesan, P.3
  • 4
    • 84976810280 scopus 로고
    • Copy detection mechanisms for digital documents
    • S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In SIGMOD, p. 398-409, 1995.
    • (1995) SIGMOD , pp. 398-409
    • Brin, S.1    Davis, J.2    García-Molina, H.3
  • 5
    • 79956075292 scopus 로고    scopus 로고
    • Identifying and filtering near-duplicate documents
    • A. Z. Broder. Identifying and filtering near-duplicate documents. In COM, p. 1-10, 2000.
    • (2000) COM , pp. 1-10
    • Broder, A.Z.1
  • 6
    • 57349121454 scopus 로고    scopus 로고
    • A derandomization using min-wise independent permutations
    • A. Z. Broder, M. Charikar, and M. Mitzenmacher. A derandomization using min-wise independent permutations. J. Discrete Algorithms, 1(1):11-20, 2003.
    • (2003) J. Discrete Algorithms , vol.1 , Issue.1 , pp. 11-20
    • Broder, A.Z.1    Charikar, M.2    Mitzenmacher, M.3
  • 8
    • 0002705495 scopus 로고
    • Automatic retrieval with locality information using SMART
    • C. Buckley, G. Salton, and J. Allan. Automatic retrieval with locality information using SMART. In TREC, p. 59-72, 1992.
    • (1992) TREC , pp. 59-72
    • Buckley, C.1    Salton, G.2    Allan, J.3
  • 9
    • 34547631801 scopus 로고    scopus 로고
    • A document-centric approach to static index pruning in text retrieval systems
    • S. Büttcher and C L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In CIKM, p. 182-189, 2006.
    • (2006) CIKM , pp. 182-189
    • Büttcher, S.1    Clarke, C.L.A.2
  • 10
    • 0036040277 scopus 로고    scopus 로고
    • Similarity estimation techniques from rounding algorithms
    • M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, p. 380-388, 2002.
    • (2002) STOC , pp. 380-388
    • Charikar, M.S.1
  • 12
    • 0013206133 scopus 로고    scopus 로고
    • Collection statistics for fast duplicate document detection
    • A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171-191, 2002.
    • (2002) ACM Trans. Inf. Syst , vol.20 , Issue.2 , pp. 171-191
    • Chowdhury, A.1    Frieder, O.2    Grossman, D.3    McCabe, M.C.4
  • 14
    • 12244271239 scopus 로고    scopus 로고
    • Online duplicate document detection: Signature reliability in a dynamic retrieval environment
    • J. G. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: signature reliability in a dynamic retrieval environment. In CIKM, p. 443-452, 2003.
    • (2003) CIKM , pp. 443-452
    • Conrad, J.G.1    Guo, X.S.2    Schriber, C.P.3
  • 15
    • 15044355327 scopus 로고    scopus 로고
    • Similarity search in high dimensions via hashing
    • A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, p. 518-529, 1999.
    • (1999) VLDB , pp. 518-529
    • Gionis, A.1    Indyk, P.2    Motwani, R.3
  • 16
    • 33750296887 scopus 로고    scopus 로고
    • Finding near-duplicate Web pages: A large-scale evaluation of algorithms
    • M. Henzinger. Finding near-duplicate Web pages: a large-scale evaluation of algorithms. In SIGIR, p. 284-291, 2006.
    • (2006) SIGIR , pp. 284-291
    • Henzinger, M.1
  • 17
    • 0037319544 scopus 로고    scopus 로고
    • Methods for identifying versioned and plagiarized documents
    • T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203-215, 2003.
    • (2003) JASIST , vol.54 , Issue.3 , pp. 203-215
    • Hoad, T.C.1    Zobel, J.2
  • 18
    • 0344612511 scopus 로고    scopus 로고
    • A small approximately min-wise independent family of hash functions
    • P. Indyk. A small approximately min-wise independent family of hash functions. J. Algorithms, 38(1):84-90, 2001.
    • (2001) J. Algorithms , vol.38 , Issue.1 , pp. 84-90
    • Indyk, P.1
  • 20
    • 0031644241 scopus 로고    scopus 로고
    • Approximate nearest neighbors: Towards removing the curse of dimensionality
    • P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, p. 604-613, 1998.
    • (1998) STOC , pp. 604-613
    • Indyk, P.1    Motwani, R.2
  • 21
    • 9444294778 scopus 로고    scopus 로고
    • From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering
    • D. Klein, S. D. Kamvar, and C. D. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In ICML, p. 307-314, 2002.
    • (2002) ICML , pp. 307-314
    • Klein, D.1    Kamvar, S.D.2    Manning, C.D.3
  • 22
    • 12244261882 scopus 로고    scopus 로고
    • Improved robustness of signature-based near-replica detection via lexicon randomization
    • A. Kolcz, A. Chowdhury, and J. Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. In KDD, p. 605-610, 2004.
    • (2004) KDD , pp. 605-610
    • Kolcz, A.1    Chowdhury, A.2    Alspector, J.3
  • 23
    • 84955245129 scopus 로고    scopus 로고
    • Multi-probe LSH: Efficient indexing for high-dimensional similarity search
    • Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In VLDB, p. 950-961, 2007.
    • (2007) VLDB , pp. 950-961
    • Lv, Q.1    Josephson, W.2    Wang, Z.3    Charikar, M.4    Li, K.5
  • 24
    • 85043988965 scopus 로고
    • Finding similar files in a large file system
    • U. Manber. Finding similar files in a large file system. In WTEC, p. 2, 1994.
    • (1994) WTEC , pp. 2
    • Manber, U.1
  • 25
    • 35348911985 scopus 로고    scopus 로고
    • Detecting near-duplicates for Web crawling
    • G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for Web crawling. In WWW, p. 141-150, 2007.
    • (2007) , pp. 141-150
    • Manku, G.S.1    Jain, A.2    Sarma, A.D.3
  • 26
    • 79960290151 scopus 로고    scopus 로고
    • Web Sociologist's Workbench: http://dbpubs.stanford.edu/~testbed/doc2/ WabBase/SGERHighlight.pdf
    • Workbench
    • Sociologist's, W.1
  • 28
    • 0013454721 scopus 로고    scopus 로고
    • Finding near-replicas of documents and servers on the Web
    • N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents and servers on the Web. In WebDB, p. 204-212, 1998.
    • (1998) WebDB , pp. 204-212
    • Shivakumar, N.1    Garcia-Molina, H.2
  • 29
    • 85158080410 scopus 로고    scopus 로고
    • Clustering with instance-level constraints
    • K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In AAAI/IAAI, p. 1097, 2000.
    • (2000) AAAI/IAAI , pp. 1097
    • Wagstaff, K.1    Cardie, C.2
  • 30
    • 33750311279 scopus 로고    scopus 로고
    • Near-duplicate detection by instance-level constrained clustering
    • H. Yang and J. P. Callan. Near-duplicate detection by instance-level constrained clustering. In SIGIR, p. 421-428, 2006.
    • (2006) SIGIR , pp. 421-428
    • Yang, H.1    Callan, J.P.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.