-
1
-
-
83055196622
-
-
Internet Archive. http://www.archive.org, 2010.
-
(2010)
-
-
-
2
-
-
83055196629
-
-
Project Gutenberg. http://www.gutenberg.org, 2010.
-
(2010)
-
-
-
3
-
-
84871101442
-
A scalable system for identifying co-derivative documents
-
Y. Bernstein and J. Zobel. A scalable system for identifying co-derivative documents. In SPIRE, pages 55-67, 2004.
-
(2004)
SPIRE
, pp. 55-67
-
-
Bernstein, Y.1
Zobel, J.2
-
4
-
-
84976810280
-
Copy detection mechanisms for digital documents
-
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In ACM SIGMOD, pages 398-409, 1995.
-
(1995)
ACM SIGMOD
, pp. 398-409
-
-
Brin, S.1
Davis, J.2
Garcia-Molina, H.3
-
5
-
-
0010362121
-
Syntactic clustering of the web
-
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157-1166, 1997.
-
(1997)
Computer Networks
, vol.29
, Issue.8-13
, pp. 1157-1166
-
-
Broder, A.Z.1
Glassman, S.C.2
Manasse, M.S.3
Zweig, G.4
-
6
-
-
0036040277
-
Similarity estimation techniques from rounding algorithms
-
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In 34th Ann. ACM Symp. on Theory of computing, pages 380-388, 2002.
-
(2002)
34th Ann. ACM Symp. on Theory of Computing
, pp. 380-388
-
-
Charikar, M.S.1
-
7
-
-
0013206133
-
Collection statistics for fast duplicate document detection
-
DOI 10.1145/506309.506311
-
A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171-191, 2002. (Pubitemid 44642301)
-
(2002)
ACM Transactions on Information Systems
, vol.20
, Issue.2
, pp. 171-191
-
-
Chowdhury, A.1
Frieder, O.2
Grossman, D.3
McCabe, M.C.4
-
9
-
-
0037481029
-
Detecting similar documents using salient terms
-
J. Cooper, A. Coden, and E. Brown. Detecting similar documents using salient terms. In CIKM, pages 245-251, 2002.
-
(2002)
CIKM
, pp. 245-251
-
-
Cooper, J.1
Coden, A.2
Brown, E.3
-
10
-
-
77956386944
-
Solving longest common subsequence and related problems on graphical processing units
-
July
-
S. Deorowicz. Solving longest common subsequence and related problems on graphical processing units. Softw. Pract. Exper., 40:673-700, July 2010.
-
(2010)
Softw. Pract. Exper.
, vol.40
, pp. 673-700
-
-
Deorowicz, S.1
-
11
-
-
77953896957
-
Identifying duplicate content using statistically improbable phrases
-
M. Errami, Z. Sun, A. C. George, T. C. Long, M. A. Skinner, J. D. Wren, and H. R. Garner. Identifying duplicate content using statistically improbable phrases. Bioinformatics, 26(11):1453-1457, 2010.
-
(2010)
Bioinformatics
, vol.26
, Issue.11
, pp. 1453-1457
-
-
Errami, M.1
Sun, Z.2
George, A.C.3
Long, T.C.4
Skinner, M.A.5
Wren, J.D.6
Garner, H.R.7
-
12
-
-
34247235660
-
A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
-
S. Feng and R. Manmatha. A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books. In JCDL, pages 109-118, 2006.
-
(2006)
JCDL
, pp. 109-118
-
-
Feng, S.1
Manmatha, R.2
-
13
-
-
77956039068
-
Adaptive near-duplicate detection via similarity learning
-
H. Hajishirzi, W. tau Yih, and A. Kolcz. Adaptive near-duplicate detection via similarity learning. In SIGIR'10, pages 419-426, 2010.
-
(2010)
SIGIR'10
, pp. 419-426
-
-
Hajishirzi, H.1
Tau Yih, W.2
Kolcz, A.3
-
15
-
-
33750296887
-
Finding near-duplicate web pages: A large-scale evaluation of algorithms
-
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In ACM SIGIR, pages 284-291, 2006.
-
(2006)
ACM SIGIR
, pp. 284-291
-
-
Henzinger, M.1
-
16
-
-
0037319544
-
Methods for identifying versioned and plagiarized documents
-
T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203-215, 2003.
-
(2003)
JASIST
, vol.54
, Issue.3
, pp. 203-215
-
-
Hoad, T.C.1
Zobel, J.2
-
18
-
-
0017492836
-
A fast algorithm for computing longest common subsequences
-
May
-
J. W. Hunt and T. G. Szymanski. A fast algorithm for computing longest common subsequences. Commun. ACM, 20:350-353, May 1977.
-
(1977)
Commun. ACM
, vol.20
, pp. 350-353
-
-
Hunt, J.W.1
Szymanski, T.G.2
-
19
-
-
0005180705
-
An information-theoretic definition of similarity
-
D. Lin. An information-theoretic definition of similarity. In ICML '98, pages 296-304, 1998.
-
(1998)
ICML '98
, pp. 296-304
-
-
Lin, D.1
-
20
-
-
85043988965
-
Finding similar files in a large file system
-
U. Manber. Finding similar files in a large file system. In USENIX Winter 1994 Tech. Conf, pages 1-10, 1994.
-
(1994)
USENIX Winter 1994 Tech. Conf
, pp. 1-10
-
-
Manber, U.1
-
21
-
-
26944455145
-
Hierarchical catalog records: Implementing a FRBR catalog
-
Oct
-
D. Mimno, G. Crane, and A. Jones. Hierarchical catalog records: Implementing a FRBR catalog. In D-Lib Magazine, http://www.dlib.org/dlib/ october05/crane/10crane.html, volume 11, Oct 2005.
-
(2005)
D-Lib Magazine
, vol.11
-
-
Mimno, D.1
Crane, G.2
Jones, A.3
-
23
-
-
57349177560
-
Local text reuse detection
-
J. Seo and W. B. Croft. Local text reuse detection. In ACM SIGIR, pages 571-578, 2008.
-
(2008)
ACM SIGIR
, pp. 571-578
-
-
Seo, J.1
Croft, W.B.2
-
26
-
-
36349036645
-
A new generation of textual corpora: Mining corpora from very large collections
-
G. Stewart, G. Crane, and A. Babeu. A new generation of textual corpora: mining corpora from very large collections. In JCDL, pages 356-365, 2007.
-
(2007)
JCDL
, pp. 356-365
-
-
Stewart, G.1
Crane, G.2
Babeu, A.3
|