SCOPUS 정보 검색 플랫폼

ACM SIGIR 2008 - 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Proceedings

Volumn , Issue , 2008, Pages 563-570

SpotSigs: Robust and efficient near duplicate detection in large web collections

(3) Theobald, Martin a Siddharth, Jonathan a Paepcke, Andreas a

a Stanford University (United States)

Author keywords

High dimensional similarity search; Inverted index pruning; Optimal partitioning; Stopword signatures

Indexed keywords

INFORMATION RETRIEVAL; INFORMATION RETRIEVAL SYSTEMS; INFORMATION SERVICES; PATTERN MATCHING; RESEARCH AND DEVELOPMENT MANAGEMENT; WORLD WIDE WEB;

DOCUMENT SIGNATURES; DUPLICATE DETECTIONS; EXECUTION TIMES; HIGH-DIMENSIONAL SIMILARITY SEARCH; INVERTED INDEX PRUNING; INVERTED INDICES; MATCHING ALGORITHMS; NEW ALGORITHMS; NEWS ARTICLES; OPTIMAL PARTITIONING; PRECISION AND RECALLS; SELF-TUNING; SENSITIVE HASHING; SIMILARITY SEARCHES; STOPWORD SIGNATURES; WEB ARCHIVES; WEB COLLECTIONS; WEB CRAWLS; WEB PAGES;

DATABASE SYSTEMS;

EID: 57349131623 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/1390334.1390431 Document Type: Conference Paper

Times cited : (147)

References (30)

1
- 38749118638
- Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
- A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, p. 459-468, 2006.
- (2006) FOCS , pp. 459-468
- Andoni, A.¹ Indyk, P.²

2
- 0005540823
- Addison-Wesley Longman Publishing Co, Inc, Boston, MA, USA
- R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.
- (1999) Modern Information Retrieval
- Baeza-Yates, R.A.¹ Ribeiro-Neto, B.²

3
- 38149034062
- LSH forest: Self-tuning indexes for similarity search
- M. Bawa, T. Condie, and P. Ganesan. LSH forest: self-tuning indexes for similarity search. In WWW, p. 651-660, 2005.
- (2005) , pp. 651-660
- Bawa, M.¹ Condie, T.² Ganesan, P.³

4
- 84976810280
- Copy detection mechanisms for digital documents
- S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In SIGMOD, p. 398-409, 1995.
- (1995) SIGMOD , pp. 398-409
- Brin, S.¹ Davis, J.² García-Molina, H.³

5
- 79956075292
- Identifying and filtering near-duplicate documents
- A. Z. Broder. Identifying and filtering near-duplicate documents. In COM, p. 1-10, 2000.
- (2000) COM , pp. 1-10
- Broder, A.Z.¹

6
- 57349121454
- A derandomization using min-wise independent permutations
- A. Z. Broder, M. Charikar, and M. Mitzenmacher. A derandomization using min-wise independent permutations. J. Discrete Algorithms, 1(1):11-20, 2003.
- (2003) J. Discrete Algorithms , vol.1 , Issue.1 , pp. 11-20
- Broder, A.Z.¹ Charikar, M.² Mitzenmacher, M.³

7
- 0010362121
- Syntactic clustering of the Web
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Computer Networks, 29(8-13):1157-1166, 1997.
- (1997) Computer Networks , vol.29 , Issue.8-13 , pp. 1157-1166
- Broder, A.Z.¹ Glassman, S.C.² Manasse, M.S.³ Zweig, G.⁴

8
- 0002705495
- Automatic retrieval with locality information using SMART
- C. Buckley, G. Salton, and J. Allan. Automatic retrieval with locality information using SMART. In TREC, p. 59-72, 1992.
- (1992) TREC , pp. 59-72
- Buckley, C.¹ Salton, G.² Allan, J.³

9
- 34547631801
- A document-centric approach to static index pruning in text retrieval systems
- S. Büttcher and C L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In CIKM, p. 182-189, 2006.
- (2006) CIKM , pp. 182-189
- Büttcher, S.¹ Clarke, C.L.A.²

10
- 0036040277
- Similarity estimation techniques from rounding algorithms
- M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, p. 380-388, 2002.
- (2002) STOC , pp. 380-388
- Charikar, M.S.¹

11
- 33747096982
- Stanford WebBase components and applications
- 153-186
- J. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, S. Raghavan, and G. Wesley. Stanford WebBase components and applications. ACM Trans. Inter. Tech., 6(2):153-186, 2006.
- (2006) ACM Trans. Inter. Tech , vol.6 , Issue.2
- Cho, J.¹ Garcia-Molina, H.² Haveliwala, T.³ Lam, W.⁴ Paepcke, A.⁵ Raghavan, S.⁶ Wesley, G.⁷

12
- 0013206133
- Collection statistics for fast duplicate document detection
- A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171-191, 2002.
- (2002) ACM Trans. Inf. Syst , vol.20 , Issue.2 , pp. 171-191
- Chowdhury, A.¹ Frieder, O.² Grossman, D.³ McCabe, M.C.⁴

13
- 0035051307
- Finding interesting associations without support pruning
- E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. Knowledge and Data Engineering, 13(1):64-78, 2001.
- (2001) Knowledge and Data Engineering , vol.13 , Issue.1 , pp. 64-78
- Cohen, E.¹ Datar, M.² Fujiwara, S.³ Gionis, A.⁴ Indyk, P.⁵ Motwani, R.⁶ Ullman, J.D.⁷ Yang, C.⁸

14
- 12244271239
- Online duplicate document detection: Signature reliability in a dynamic retrieval environment
- J. G. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: signature reliability in a dynamic retrieval environment. In CIKM, p. 443-452, 2003.
- (2003) CIKM , pp. 443-452
- Conrad, J.G.¹ Guo, X.S.² Schriber, C.P.³

15
- 15044355327
- Similarity search in high dimensions via hashing
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, p. 518-529, 1999.
- (1999) VLDB , pp. 518-529
- Gionis, A.¹ Indyk, P.² Motwani, R.³

16
- 33750296887
- Finding near-duplicate Web pages: A large-scale evaluation of algorithms
- M. Henzinger. Finding near-duplicate Web pages: a large-scale evaluation of algorithms. In SIGIR, p. 284-291, 2006.
- (2006) SIGIR , pp. 284-291
- Henzinger, M.¹

17
- 0037319544
- Methods for identifying versioned and plagiarized documents
- T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203-215, 2003.
- (2003) JASIST , vol.54 , Issue.3 , pp. 203-215
- Hoad, T.C.¹ Zobel, J.²

18
- 0344612511
- A small approximately min-wise independent family of hash functions
- P. Indyk. A small approximately min-wise independent family of hash functions. J. Algorithms, 38(1):84-90, 2001.
- (2001) J. Algorithms , vol.38 , Issue.1 , pp. 84-90
- Indyk, P.¹

19
- 24644489770
- Nearest neighbors in high-dimensional spaces
- CRC Press
- P. Indyk. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry. CRC Press, 2004.
- (2004) Handbook of Discrete and Computational Geometry
- Indyk, P.¹

20
- 0031644241
- Approximate nearest neighbors: Towards removing the curse of dimensionality
- P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, p. 604-613, 1998.
- (1998) STOC , pp. 604-613
- Indyk, P.¹ Motwani, R.²

21
- 9444294778
- From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering
- D. Klein, S. D. Kamvar, and C. D. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In ICML, p. 307-314, 2002.
- (2002) ICML , pp. 307-314
- Klein, D.¹ Kamvar, S.D.² Manning, C.D.³

22
- 12244261882
- Improved robustness of signature-based near-replica detection via lexicon randomization
- A. Kolcz, A. Chowdhury, and J. Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. In KDD, p. 605-610, 2004.
- (2004) KDD , pp. 605-610
- Kolcz, A.¹ Chowdhury, A.² Alspector, J.³

23
- 84955245129
- Multi-probe LSH: Efficient indexing for high-dimensional similarity search
- Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In VLDB, p. 950-961, 2007.
- (2007) VLDB , pp. 950-961
- Lv, Q.¹ Josephson, W.² Wang, Z.³ Charikar, M.⁴ Li, K.⁵

24
- 85043988965
- Finding similar files in a large file system
- U. Manber. Finding similar files in a large file system. In WTEC, p. 2, 1994.
- (1994) WTEC , pp. 2
- Manber, U.¹

25
- 35348911985
- Detecting near-duplicates for Web crawling
- G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for Web crawling. In WWW, p. 141-150, 2007.
- (2007) , pp. 141-150
- Manku, G.S.¹ Jain, A.² Sarma, A.D.³

26
- 79960290151
- Web Sociologist's Workbench: http://dbpubs.stanford.edu/~testbed/doc2/ WabBase/SGERHighlight.pdf
- Workbench
- Sociologist's, W.¹

27
- 57349168274
- N. Shivakumar and H. García-Molina. SCAM: A copy detection mechanism for digital documents. In DL, 1995.
- (1995) SCAM: A copy detection mechanism for digital documents , vol.550
- Shivakumar, N.¹ García-Molina, H.²

28
- 0013454721
- Finding near-replicas of documents and servers on the Web
- N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents and servers on the Web. In WebDB, p. 204-212, 1998.
- (1998) WebDB , pp. 204-212
- Shivakumar, N.¹ Garcia-Molina, H.²

29
- 85158080410
- Clustering with instance-level constraints
- K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In AAAI/IAAI, p. 1097, 2000.
- (2000) AAAI/IAAI , pp. 1097
- Wagstaff, K.¹ Cardie, C.²

30
- 33750311279
- Near-duplicate detection by instance-level constrained clustering
- H. Yang and J. P. Callan. Near-duplicate detection by instance-level constrained clustering. In SIGIR, p. 421-428, 2006.
- (2006) SIGIR , pp. 421-428
- Yang, H.¹ Callan, J.P.²

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.