-
3
-
-
79953110433
-
-
URL Normalization. http://en.wikipedia.org/wiki/URL normalization.
-
URL Normalization
-
-
-
4
-
-
74549172900
-
URL normalization for de-duplication of web pages
-
A. Agarwal, H. S. Koppula, K. P. Leela, K. P. Chitrapura, S. Garg, and P. K. GM. URL normalization for de-duplication of web pages. In Proc. CIKM, pages 1987-1990, 2009.
-
(2009)
Proc. CIKM
, pp. 1987-1990
-
-
Agarwal, A.1
Koppula, H.S.2
Leela, K.P.3
Chitrapura, K.P.4
Garg, S.5
Gm, P.K.6
-
5
-
-
37549058056
-
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
-
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117-122, 2008.
-
(2008)
Commun. ACM
, vol.51
, Issue.1
, pp. 117-122
-
-
Andoni, A.1
Indyk, P.2
-
6
-
-
84961379302
-
Finding patterns common to a set of strings
-
D. Angluin. Finding patterns common to a set of strings. In SOTC, pages 130-141, 1979.
-
(1979)
SOTC
, pp. 130-141
-
-
Angluin, D.1
-
7
-
-
35348921241
-
Do not crawl in the dust: Different URLs with similar text
-
Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the dust: different URLs with similar text. In WWW, pages 111-120, 2007.
-
(2007)
WWW
, pp. 111-120
-
-
Bar-Yossef, Z.1
Keidar, I.2
Schonfeld, U.3
-
8
-
-
0038589165
-
The anatomy of a large-scale hypertextual web search engine
-
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1-7):107-117, 1998.
-
(1998)
Computer Networks
, vol.30
, Issue.1-7
, pp. 107-117
-
-
Brin, S.1
Page, L.2
-
9
-
-
0010362121
-
Syntactic clustering of the Web
-
A. Broder, S. C. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the Web. Computer Networks, 29(8-13):1157-1166, 1997.
-
(1997)
Computer Networks
, vol.29
, Issue.8-13
, pp. 1157-1166
-
-
Broder, A.1
Glassman, S.C.2
Manasse, M.3
Zweig, G.4
-
10
-
-
34248158280
-
A cost-effective method for detecting web site replicas on search engine databases
-
A. C. Carvalho, E. S. Moura, A. S. Silva, K. Berlt, and A. Bezerra. A cost-effective method for detecting web site replicas on search engine databases. Data Knowl. Eng., 62(3):421-437, 2007.
-
(2007)
Data Knowl. Eng.
, vol.62
, Issue.3
, pp. 421-437
-
-
Carvalho, A.C.1
Moura, E.S.2
Silva, A.S.3
Berlt, K.4
Bezerra, A.5
-
11
-
-
0036040277
-
Similarity estimation techniques from rounding algorithms
-
M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. SOTC, pages 380-388, 2002.
-
(2002)
Proc. SOTC
, pp. 380-388
-
-
Charikar, M.1
-
12
-
-
0013206133
-
Collection statistics for fast duplicate document detection
-
A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. TOIS, 20(2):171-191, 2002.
-
(2002)
TOIS
, vol.20
, Issue.2
, pp. 171-191
-
-
Chowdhury, A.1
Frieder, O.2
Grossman, D.A.3
McCabe, M.C.4
-
13
-
-
65449167674
-
De-duping URLs via rewrite rules
-
A. Dasgupta, R. Kumar, and A. Sasturkar. De-duping URLs via rewrite rules. In KDD, pages 186-194, 2008.
-
(2008)
KDD
, pp. 186-194
-
-
Dasgupta, A.1
Kumar, R.2
Sasturkar, A.3
-
14
-
-
33750296887
-
Finding near-duplicate web pages: A large-scale evaluation of algorithms
-
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284-291, 2006.
-
(2006)
SIGIR
, pp. 284-291
-
-
Henzinger, M.1
-
15
-
-
35348911985
-
Detecting near-duplicates for web crawling
-
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. WWW, pages 141-150, 2007.
-
(2007)
Proc. WWW
, pp. 141-150
-
-
Manku, G.S.1
Jain, A.2
Sarma, A.D.3
-
16
-
-
85017291425
-
Systems and methods for inferring uniform resource locator (URL) normalization rules
-
US Patent Application Publication, 2006/0218143, Microsoft Corporation
-
M. Najork. Systems and methods for inferring uniform resource locator (URL) normalization rules. US Patent Application Publication, 2006/0218143, Microsoft Corporation, 2006.
-
(2006)
-
-
Najork, M.1
-
17
-
-
0003676885
-
-
Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University
-
M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
-
(1981)
Fingerprinting by Random Polynomials
-
-
Rabin, M.O.1
-
18
-
-
66249113620
-
Efficient similarity joins for near duplicate detection
-
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proc. WWW, pages 131-140, 2008.
-
(2008)
Proc. WWW
, pp. 131-140
-
-
Xiao, C.1
Wang, W.2
Lin, X.3
Yu, J.X.4
|