SCOPUS 정보 검색 플랫폼

Studies in Computational Intelligence

Volumn 375, Issue , 2011, Pages 385-412

Data de-duplication: A review

(4) Costa, Gianni a Cuzzocrea, Alfredo a Manco, Giuseppe a Ortale, Riccardo a

a ICAR CNR (Italy)

Author keywords

[No Author keywords available]

Indexed keywords

EID: 80455127017 PISSN: 1860949X EISSN: None Source Type: Book Series
DOI: 10.1007/978-3-642-22913-8_18 Document Type: Review

Times cited : (7)

References (99)

1
- 12244298488
- In: Proc. of ACM SIGKDD Int. Conf. On Knowledge Discovery and Data Mining Seattle Washington USA
- Agichtein, E., Ganti, V.:Mining Reference Tables for Automatic Text Segmentation. Proc. of ACM SIGKDD Int. Conf. On Knowledge Discovery and Data Mining, Seattle, Washington, USA, pp. 20-29 (2004)
- (2004) Mining Reference Tables for Automatic Text Segmentation , pp. 20-29
- Agichtein, E.¹ Ganti, V.²

2
- 2342576574
- Proc. of Int. Conf. on Very Large Databases, Hong Kong China
- Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. Proc. of Int. Conf. on Very Large Databases, Hong Kong, China, pp. 586-597 (2002)
- (2002) Eliminating Fuzzy Duplicates in Data Warehouses , pp. 586-597
- Ananthakrishna, R.¹ Chaudhuri, S.² Ganti, V.³

3
- 38749118638
- Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
- Las Vegas Nevada USA
- Andoni, A., Indyk, P.: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. Proc. of IEEE Symposium on Foundations of Computer Science, Las Vegas, Nevada, USA, pp. 459-468 (2006)
- (2006) Proc. of IEEE Symposium on Foundations of Computer Science , pp. 459-468
- Andoni, A.¹ Indyk, P.²

4
- 37549058056
- Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
- Andoni, A., Indyk, P.: Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. Communications of the ACM 51(1), 117-122 (2008)
- (2008) Communications of the ACM , vol.51 , Issue.1 , pp. 117-122
- Andoni, A.¹ Indyk, P.²

5
- 85104914015
- Seoul Korea
- Arasu, A., Ganti, V., Kaushik, R.: Efficient Exact Set-Similarity Joins. Proc. of Int. Conf. on Very Large Databases, Seoul, Korea, pp. 918-929 (2006)
- (2006) Efficient Exact Set-Similarity Joins. Proc. of Int. Conf. on Very Large Databases , pp. 918-929
- Arasu, A.¹ Ganti, V.² Kaushik, R.³

6
- 0001592068
- Automatic Linkage of Vital Records
- Axford, S.J., Newcombe, H.B., Kennedy, J.M., James, A.P.:Automatic Linkage of Vital Records. Science 130, 954-959 (1959)
- (1959) Science , vol.130 , pp. 954-959
- Axford, S.J.¹ Newcombe, H.B.² Kennedy, J.M.³ James, A.P.⁴

7
- 27944439775
- Modern information retrieval
- Addison-Wesley
- Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
- (1999) Reading
- Baeza-Yates, R.¹ Ribeiro-Neto, B.²

8
- 3142665421
- Correlation clustering
- Bansal, N., Blum, A., Chawla, S.: Correlation Clustering. Machine Learning 56(1-3), 89-113 (2004)
- (2004) Machine Learning , vol.56 , Issue.1-3 , pp. 89-113
- Bansal, N.¹ Blum, A.² Chawla, S.³

9
- 80455177968
- Chiba Japan
- Bawa, M., Tyson Condie, S., Ganesan, P.: LSH Forest: Self-Tuning Indexes for Similarity Search. Proc. of Int. Conf. on World Wide Web, Chiba, Japan, pp. 651-660 (2005)
- (2005) LSH Forest: Self-Tuning Indexes for Similarity Search. Proc. of Int. Conf. on World Wide Web , pp. 651-660
- Bawa, M.¹ Tyson Condie, S.² Ganesan, P.³

10
- 35348849154
- Banff Alberta Canada
- Bayardo, R.J., Srikant, R., Ma, Y.: Scaling Up All Pairs Similarity Search. Proc. of Int. Conf. on World Wide Web, Banff, Alberta, Canada, pp. 131-140 (2007)
- (2007) Scaling Up All Pairs Similarity Search. Proc. of Int. Conf. on World Wide Web , pp. 131-140
- Bayardo, R.J.¹ Srikant, R.² Ma, Y.³

11
- 58149472338
- Swoosh: A generic approach to entity resolution
- Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB Journal 18(1), 255-276 (2009)
- (2009) VLDB Journal , vol.18 , Issue.1 , pp. 255-276
- Benjelloun, O.¹ Garcia-Molina, H.² Menestrina, D.³ Su, Q.⁴ Whang, S.E.⁵ Widom, J.⁶

12
- 26444482529
- Differential cryptanalysis mod 232 with applications to md5
- Berson, T.A.: Differential Cryptanalysis Mod 232 with Applications to MD5. Proc. of Ann. Conf. on Theory and Applications of Cryptographic Techniques, pp. 71-80 (1992)
- (1992) Proc. of Ann. Conf. on Theory and Applications of Cryptographic Techniques , pp. 71-80
- Berson, T.A.¹

13
- 34248229658
- ACM Trans. Knowl. Discovery from Data
- Bhattacharya, I., Getoor, L.: Collective Entity Resolution in Relational Data. ACM Trans. Knowl. Discovery from Data 1(1), 1-35 (2007)
- (2007) Collective Entity Resolution in Relational Data , vol.1 , Issue.1 , pp. 1-35
- Bhattacharya, I.¹ Getoor, L.²

14
- 33749549918
- Philadelphia, Pennsylvania USA
- Bhattacharya, I., Getoor, L., Licamele, Louis: QueryTime Entity Resolution. Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Philadelphia, Pennsylvania, USA, pp. 529-534 (2006)
- (2006) Licamele Louis QueryTime Entity Resolution. Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining , pp. 529-534
- Bhattacharya, I.¹ Getoor, L.²

15
- 77952372966
- Adaptive duplicate detection using learnable string similarity measures
- proc. of Washington DC USA
- Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Washington, DC, USA, pp. 39-48 (2003)
- (2003) ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining , pp. 39-48
- Bilenko, M.¹ Mooney, R.J.²

16
- 65449165865
- Tech. Rep. TR-CS- 07-03 Australian National University Canberra Australia
- Christen, P.: Towards Parameter-free Blocking for Scalable Record Linkage. Tech. Rep. TR-CS-07-03, Australian National University, Canberra, Australia (2007)
- (2007) Towards Parameter-free Blocking for Scalable Record Linkage
- Christen, P.¹

17
- 65449178105
- In: Proc. of ACM Int. Conf. on Knowledge Discovery and Data Mining
- Christen, P.: Febrl - An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface. Proc. of ACM Int. Conf. on Knowledge Discovery and Data Mining, pp. 1065-1068 (2008)
- (2008) Febrl - An Open Source Data Cleaning Deduplication and Record Linkage System with a Graphical User Interface , pp. 1065-1068
- Christen, P.¹

18
- 0034832365
- In: Proc. of ACM SIGMOD Int. Conf. on Management of Data Santa Barbara California USA
- Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic Segmentation of Text into Structured Records. Proc. of ACM SIGMOD Int. Conf. on Management of Data, Santa Barbara, California, USA, pp. 175-186 (2001)
- (2001) Automatic Segmentation of Text into Structured Records , pp. 175-186
- Borkar, V.R.¹ Deshmukh, K.² Sarawagi, S.³

19
- 0031620041
- Minwise independent permutations
- USA
- Broder, A., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Minwise Independent Permutations. Proc. of ACMSymposium on Theory of Computing, Dallas, Texas, USA, pp. 327-336 (1998)
- (1998) Proc. of ACMSymposium on Theory of Computing Dallas Texas , pp. 327-336
- Broder, A.¹ Charikar, M.² Frieze, A.M.³ Mitzenmacher, M.⁴

20
- 0010362121
- Santa Clara California USA
- Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic Clustering on theWeb. Proc. of Int. Conf. on World Wide Web, Santa Clara, California, USA, pp. 1157-1166 (1997)
- (1997) Syntactic Clustering on theWeb. Proc. of Int. Conf. on World Wide Web , pp. 1157-1166
- Broder, A.¹ Glassman, S.² Manasse, M.³ Zweig, G.⁴

21
- 44649181012
- Cesario, E., Folino, F., Locane, A., Manco, G., Ortale, R.: Boosting Text Segmentation Via Progressive Classification. Knowl. and Inf. Syst. 15(3), 285-320 (2008)
- (2008) Boosting Text Segmentation Via Progressive Classification. Knowl. and Inf. Syst. , vol.15 , Issue.3 , pp. 285-320
- Cesario, E.¹ Folino, F.² Locane, A.³ Manco, G.⁴ Ortale, R.⁵

22
- 1142279457
- Proc. of ACM SIGMOD Conf. on Management of Data, San Diego California USA
- Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and Efficient Fuzzy Match for Online Data Cleaning. Proc. of ACM SIGMOD Conf. on Management of Data, San Diego, California, USA, pp. 313-324 (2003)
- (2003) Robust and Efficient Fuzzy Match for Online Data Cleaning , pp. 313-324
- Chaudhuri, S.¹ Ganjam, K.² Ganti, V.³ Motwani, R.⁴

23
- 26444550791
- Tokyo Japan
- Chaudhuri, S., Ganti, V., Motwani, R.: Robust Identification of Fuzzy Duplicates. Proc. of Int. Conf. on Data Engineering, Tokyo, Japan, pp. 865-876 (2005)
- (2005) Robust Identification of Fuzzy Duplicates. Proc. of Int. Conf. on Data Engineering , pp. 865-876
- Chaudhuri, S.¹ Ganti, V.² Motwani, R.³

24
- 0345043999
- Chavez, E., Navarro, G., Baeza-Yates, R.,Marroquin, J.L.: Searching in Metric Spaces. ACM Comput. Surv. 33(3), 273-321 (2001)
- (2001) Searching in Metric Spaces. ACM Comput. Surv , vol.33 , Issue.3 , pp. 273-321
- Chavez, E.¹ Navarro, G.² Baeza-Yates, R.³ Marroquin, J.L.⁴

25
- 84993661659
- Athens Greece
- Ciaccia, P., Patella, M., Zezula, P.: M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces. Proc. of Int. Conf. on Very Large Databases, Athens, Greece, pp. 426-435 (1997)
- (1997) M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces. Proc. of Int. Conf. on Very Large Databases , pp. 426-435
- Ciaccia, P.¹ Patella, M.² Zezula, P.³

26
- 80455148347
- Department of Computer Sciences Purdue University
- Cochinwala, M., Dalal, S., Elmagarmid, A.K., Verykios, V.S.: Record Matching: Past, Present and Future. Technical Report, number CSD-TR #01-013. Department of Computer Sciences, Purdue University (2001)
- (2001) Record Matching: Past, Present and Future Technical Report number CSD-TR #01-013
- Cochinwala, M.¹ Dalal, S.² Elmagarmid, A.K.³ Verykios, V.S.⁴

27
- 0035452641
- Efficient data reconciliation
- Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.: Efficient Data Reconciliation. Information Sciences 137(1-4), 1-15 (2001)
- (2001) Information Sciences , vol.137 , Issue.1-4 , pp. 1-15
- Cochinwala, M.¹ Kurien, V.² Lalk, G.³ Shasha, D.⁴

28
- 0000666461
- Data Integration using Similarity Joins and a Word-based Information Representation Language
- Cohen, W.W.: Data Integration using Similarity Joins and a Word-based Information Representation Language. ACM Trans. on Inf. Syst. 18(3), 228-321 (2000)
- (2000) ACM Trans. on Inf. Syst. , vol.18 , Issue.3 , pp. 228-321
- Cohen, W.W.¹

29
- 11144240583
- A comparison of string distance metrics for name-matching tasks
- Acapulco Mexico
- Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance Metrics for Name-Matching Tasks. Proc. of IJCAI Workshop on Information Integration on the Web, Acapulco, Mexico, pp. 73-78 (2003)
- (2003) Proc. of IJCAI Workshop on Information Integration on the Web , pp. 73-78
- Cohen, W.W.¹ Ravikumar, P.² Fienberg, S.E.³

30
- 0242540438
- Learning to match and cluster large high-dimensional data sets for data integration
- Edmonton Alberta Canada
- Cohen, W.W., Richman, J.: Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration. Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, pp. 475-480 (2002)
- (2002) Proc. Of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining , pp. 475-480
- Cohen, W.W.¹ Richman, J.²

31
- 0028424239
- Improving generalization with active learning
- Cohn, D.A., Atlas, L., Ladner, R.E.: Improving Generalization with Active Learning. Machine Learning 15(2), 201-221 (1994)
- (1994) Machine Learning , vol.15 , Issue.2 , pp. 201-221
- Cohn, D.A.¹ Atlas, L.² Ladner, R.E.³

32
- 76749114248
- An incremental clustering scheme for data deduplication
- Costa, G., Manco, G., Ortale, R.: An Incremental Clustering Scheme for Data Deduplication. Data Min. and Knowl. Discovery 20(1), 152-187 (2010)
- (2010) Data Min. and Knowl. Discovery , vol.20 , Issue.1 , pp. 152-187
- Costa, G.¹ Manco, G.² Ortale, R.³

33
- 80455148350
- Database Group Leipzig
- Database Group Leipzig. Benchmark datasets for entity resolution, http://dbs.uni-leipzig.de/en/research/projects/objectmatching/fever/benchmark datasets for entity resolution
- Benchmark datasets for entity resolution

34
- 80455177960
- Journal of the Royal Statistical Society Series B 39
- Dempster, A.P., Laird, N.M., Rubin, D.B.:Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39(1), 1-28 (2001)
- (2001) Maximum Likelihood from Incomplete Data via the EM Algorithm , vol.1 , pp. 1-28
- Dempster, A.P.¹ Laird, N.M.² Rubin, D.B.³

35
- 0033872455
- Data redundancy and duplicate detection in spatial join processing
- Dittrich, J.-P., Seeger, B.: Data Redundancy and Duplicate Detection in Spatial Join Processing. Proc. of IEEE Int. Conf. on Data Engineering, pp. 535-546 (2000)
- (2000) Proc. of IEEE Int. Conf. on Data Engineering , pp. 535-546
- Dittrich, J.-P.¹ Seeger, B.²

36
- 33845667955
- Duplicate record detection: A survey
- Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey. IEEE Transanctions on Knowledge and Data Engineering 19(1), 1-16 (2007)
- (2007) IEEE Transanctions on Knowledge and Data Engineering , vol.19 , Issue.1 , pp. 1-16
- Elmagarmid, A.K.¹ Ipeirotis, P.G.² Verykios, V.S.³

37
- 85170282443
- Portland, Oregon USA
- Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. of Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, USA, pp. 226-231 (1996)
- (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. of Int. Conf. on Knowledge Discovery and Data Mining , pp. 226-231
- Ester, M.¹ Kriegel, H.P.² Sander, J.³ Xu, X.⁴

38
- 0030285403
- The KDD process for extracting useful knowledge from volumes of data
- Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Widener, T.: The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM 39(11), 27-34 (1996)
- (1996) Communications of the ACM , vol.39 , Issue.11 , pp. 27-34
- Fayyad, U.¹ Piatetsky-Shapiro, G.² Smyth, P.³ Widener, T.⁴

39
- 84947399464
- Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Am. Stat. Assoc. 64, 1183-1210 (1969)
- (1969) A Theory for Record Linkage. Am. Stat. Assoc. , vol.64 , pp. 1183-1210
- Fellegi, I.P.¹ Sunter, A.B.²

40
- 0032665257
- Sydney Austrialia
- Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A.: Clustering Large Datasets in Arbitrary Metric Spaces. Proc. of Int. Conf. on Data Engineering, Sydney, Austrialia, pp. 502-511 (1999)
- (1999) Clustering Large Datasets in Arbitrary Metric Spaces Proc. of Int. Conf. on Data Engineering , pp. 502-511
- Ganti, V.¹ Ramakrishnan, R.² Gehrke, J.³ Powell, A.⁴

41
- 84947935707
- Entity resolution: Overview and challenges
- Atzeni P., Chu, W. Lu H. Zhou S. Ling T.-W. eds Springer, Heidelberg
- Garcia-Molina, H.: Entity resolution: Overview and challenges. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T.-W. (eds.) ER 2004. LNCS, vol. 3288, pp. 1-2. Springer, Heidelberg (2004)
- (2004) ER 2004 LNCS , vol.3288 , pp. 1-2
- Garcia-Molina, H.¹

42
- 0003860037
- Markov chainmonte carlo in practice
- Chapman and Hall
- Gilks,W.R., Richardson, S., Spiegelhalter, D.J.:Markov ChainMonte Carlo in Practice. Chapman and Hall, Boca Raton (1996)
- (1996) Boca Raton
- Gilks, W.R.¹ Richardson, S.² Spiegelhalter, D.J.³

43
- 0001944742
- Proc. of Int. Conf. on Very Large Databases, Edinburgh Scotland
- Gionis, A., Indyk, P., Motwani, R.: Similarity Search in High Dimensions via Hashing. Proc. of Int. Conf. on Very Large Databases, Edinburgh, Scotland, pp. 518-529 (1999)
- (1999) Similarity Search in High Dimensions via Hashing , pp. 518-529
- Gionis, A.¹ Indyk, P.² Motwani, R.³

44
- 65449179112
- Proc. of Australasian Data Mining Conf.
- Goiser, K., Christen, P.: Towards Automated Record Linkage. Proc. of Australasian Data Mining Conf., pp. 23-31 (2006)
- (2006) Towards Automated Record Linkage , pp. 23-31
- Goiser, K.¹ Christen, P.²

45
- 0038119396
- Techniques of cluster algorithms in data mining
- Grabmeier, J., Rudolph, A.: Techniques of Cluster Algorithms in Data Mining. Data Min. and Knowl. Discovery 6(4), 303-360 (2002)
- (2002) Data Min. and Knowl. Discovery , vol.6 , Issue.4 , pp. 303-360
- Grabmeier, J.¹ Rudolph, A.²

46
- 84944318804
- Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate String Joins in a Database (Almost) for Free. In: Proc of Int. Conf. On Very Large Databases, Rome, Italy, pp. 491-500 (2001)
- (2001) Approximate String Joins in a Database Almost for Free Proc of Int. Conf. On Very Large Databases Rome Italy , pp. 491-500
- Gravano, L.¹ Ipeirotis, P.G.² Jagadish, H.V.³ Koudas, N.⁴ Muthukrishnan, S.⁵ Srivastava, D.⁶

47
- 80455148345
- Record linkage: Current practice and future directions
- Gu, L., Baxter, R.A., Vickers, D., Rainsford, C.: Record Linkage: Current Practice and Future Directions. Technical Report, number 03/83. CSIRO Mathematical and Information Sciences (2001)
- (2001) Technical Report number 03/83 CSIRO Mathematical and Information Sciences
- Gu, L.¹ Baxter, R.A.² Vickers, D.³ Rainsford, C.⁴

48
- 0032091595
- Proc. of ACM SIGMOD Int. Conf. on Management of Data Seattle Washington USA
- Guha, S., Rastogi, R., Shim, K.: CURE: An Efficient Clustering Algorithm for Large Databases. Proc. of ACM SIGMOD Int. Conf. on Management of Data, Seattle, Washington, USA, pp. 73-84 (1998)
- (1998) CURE: An Efficient Clustering Algorithm for Large Databases , pp. 73-84
- Guha, S.¹ Rastogi, R.² Shim, K.³

49
- 0034228041
- Guha, S., Rastogi, R., Shim, K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes. Inf. Syst. 25(5), 345-366 (2001)
- (2001) ROCK: A Robust Clustering Algorithm for Categorical Attributes. Inf. Syst. , vol.25 , Issue.5 , pp. 345-366
- Guha, S.¹ Rastogi, R.² Shim, K.³

50
- 0004137004
- Cambridge University Press Davis
- Gunsfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Davis (1997)
- (1997) Algorithms on Strings Trees and Sequences
- Gunsfield, D.¹

51
- 72649086387
- Proceedings of VLDB
- Hassanzadeh, O., Chiang, F., Lee, H.C.,Miller, R.J.: Framework for Evaluating Clustering Algorithms in Duplicate Detection. Proceedings of VLDB 2(1), 1282-1293 (2009)
- (2009) Framework for Evaluating Clustering Algorithms in Duplicate Detection , vol.2 , Issue.1 , pp. 1282-1293
- Hassanzadeh, O.¹ Chiang, F.² Lee, H.C.³ Miller, R.J.⁴

52
- 70349826301
- Creating probabilistic databases from duplicated data
- Hassanzadeh, O., Miller, R.J.: Creating Probabilistic Databases from Duplicated Data. The VLDB Journal 18(5), 1141-1166 (2009)
- (2009) VLDB Journal , vol.18 , Issue.5 , pp. 1141-1166
- Hassanzadeh, O.¹ Miller, R.J.²

53
- 84976856849
- Proc. of ACM SIGMOD Int. Conf. on Management of Data San Jose California USA
- Hernández, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. Proc. of ACM SIGMOD Int. Conf. on Management of Data, San Jose, California, USA, pp. 127-138 (1995)
- (1995) The Merge/Purge Problem for Large Databases , pp. 127-138
- Hernández, M.A.¹ Stolfo, S.J.²

54
- 0013331361
- Real-world data is dirty: Data cleansing and the merge/purge problem
- Hernández, M.A., Stolfo, J.: Real-world Data is Dirty: Data Cleansing and the Merge/Purge Problem. Data Min. and Knowl. Discovery 2(1), 9-37 (1998)
- (1998) Data Min. and Knowl. Discovery , vol.2 , Issue.1 , pp. 9-37
- Hernández, M.A.¹ Stolfo, J.²

55
- 70349248411
- Proc. Of ACM Int. Conf. on Information and Knowledge Management
- Herschel, M., Naumann, N.: Scaling up Duplicate Detection in Graph Data. Proc. of ACM Int. Conf. on Information and Knowledge Management, pp. 1325-1326 (2008)
- (2008) Scaling up Duplicate Detection in Graph Data , pp. 1325-1326
- Herschel, M.¹ Naumann, N.²

56
- 0041664272
- Index-driven similarity search in metric spaces
- Hjatason, G.R., Samet, H.: Index-Driven Similarity Search in Metric Spaces. ACM Trans. on Database Syst. 28(4), 517-518 (2003)
- (2003) ACM Trans. on Database Syst. , vol.28 , Issue.4 , pp. 517-518
- Hjatason, G.R.¹ Samet, H.²

57
- 0031644241
- Proc. of Symposium on Theory of Computing Dallas Texas USA
- Indyk, P.,Motwani, R.: Approximate Nearest Neighbor - Towards Removing the Curse of Dimensionality. Proc. of Symposium on Theory of Computing, Dallas, Texas, USA, pp. 604-613 (1998)
- (1998) Approximate Nearest Neighbor - Towards Removing the Curse of Dimensionality , pp. 604-613
- Indyk, P.¹ Motwani, R.²

58
- 33845667955
- Duplicate Record Detection: A urvey
- Ipeirotis, P.G., Verykios, V.S., Elmagarmid, A.K.: Duplicate Record Detection: A urvey. IEEE Trans. Knowl. Data Eng. 19(1), 1-16 (2007)
- (2007) IEEE Trans. Knowl. Data Eng. , vol.19 , Issue.1 , pp. 1-16
- Ipeirotis, P.G.¹ Verykios, V.S.² Elmagarmid, A.K.³

59
- 0004161991
- Prentice-Hall Englewood Cliffs
- Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1998)
- (1998) Algorithms for Clustering Data
- Jain, A.K.¹ Dubes, R.C.²

60
- 84893405732
- Data clustering: A review
- Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Comput. Surv. 31(3), 264-323 (1999)
- (1999) ACM Comput. Surv. , vol.31 , Issue.3 , pp. 264-323
- Jain, A.K.¹ Murty, M.N.² Flynn, P.J.³

61
- 84950419860
- Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa Florida
- Jaro, M.A.: Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Society 84, 420-424 (1989)
- (1989) Journal of the American Statistical Society , vol.84 , pp. 420-424
- Jaro, M.A.¹

62
- 0005609506
- U.S. General Accounting Office
- Kingsbury, N.R., et al.: Record Linkage and Privacy: Issues in Creating New Federal Research and Statistical Information. U.S. General Accounting Office (2001)
- (2001) Record Linkage and Privacy: Issues in Creating New Federal Research and Statistical Information
- Kingsbury, N.R.¹

63
- 80455148340
- Evaluation of entity resolution approaches on realworld match problems
- Kopcke, H., Rahm, E.: Frameworks for Entity Matching: A Comparison Data and Know. Engineering 69(2), 197-210 (2010) 64. Kopcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on realworld match problems. Proc. of the VLDB Endowment 3(1), 484-493 (2010)
- (2010) Proc. of the VLDB Endowment , vol.3 , Issue.1 , pp. 484-493
- Kopcke, H.¹ Thor, A.² Rahm, E.³

64
- 77954338155
- Evaluation of learning-based approaches for matching web data entities
- Kopcke, H., Thor, A., Rahm, E.: Evaluation of Learning-Based Approaches for Matching Web Data Entities. IEEE Internet Computing 14(4), 23-31 (2010)
- (2010) IEEE Internet Computing , vol.14 , Issue.4 , pp. 23-31
- Kopcke, H.¹ Thor, A.² Rahm, E.³

65
- 1542370010
- McCallum, A.: MALLET: A Machine Learning for Language Toolkit, http://mallet.cs.umass.edu
- MALLET: A Machine Learning for Language Toolkit
- McCallum, A.¹

66
- 0031166031
- Size separation spatial join
- Koudas, N., Sevcik, K.C.: Size Separation Spatial Join. Proc. of ACM Int. Conf. On Management of Data, pp. 324-335 (1997)
- (1997) Proc. of ACM Int. Conf. On Management of Data , pp. 324-335
- Koudas, N.¹ Sevcik, K.C.²

67
- 0032652968
- Autonomous citation matching
- Lawrence, S., Bollacker, K., Giles, C.L.: Autonomous Citation Matching. Proc. of ACM Int. Conf. on Autonomous Agents, pp. 392-393 (1999)
- (1999) Proc. Of ACM Int. Conf. on Autonomous Agents , pp. 392-393
- Lawrence, S.¹ Bollacker, K.² Giles, C.L.³

68
- 0035545906
- A knowledge-based approach for duplicate elimination in data cleaning
- Low, W.L., Lee, M.L., Ling, T.W.: A Knowledge-Based Approach for Duplicate Elimination in Data Cleaning. Information Systems 26(8), 585-606 (2001)
- (2001) Information Systems , vol.26 , Issue.8 , pp. 585-606
- Low, W.L.¹ Lee, M.L.² Ling, T.W.³

69
- 77952768253
- An efficient duplicate detection system for xml documents
- Lwin, T., Nyunt, T.T.S.: An Efficient Duplicate Detection System for XML Documents. Proc. of IEEE Int. Conf. on Computer Engineering and Applications, pp. 178-182 (2010)
- (2010) Proc. of IEEE Int. Conf. on Computer Engineering and Applications , pp. 178-182
- Lwin, T.¹ Nyunt, T.T.S.²

70
- 0000747663
- Proc. of Int. Conf. on Machine Learning Standord California USA
- McCallum, A., Freitag, D., Pereira, F.: Maximum Entropy Markov Models for Information Extraction and Segmentation. Proc. of Int. Conf. on Machine Learning, Standord, California, USA, pp. 591-598 (2000)
- (2000) Maximum Entropy Markov Models for Information Extraction and Segmentation , pp. 591-598
- McCallum, A.¹ Freitag, D.² Pereira, F.³

71
- 0034592784
- In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining Boston Massachusetts USA
- McCallum, A., Nigam, K., Ungar, L.: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA, pp. 169-178 (2000)
- (2000) Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching , pp. 169-178
- McCallum, A.¹ Nigam, K.² Ungar, L.³

72
- 80455148342
- In: Int. VLDB Workshop on Clean Databases Seoul, Korea
- Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic Entity Resolution with Data Confidences. In: Int. VLDB Workshop on Clean Databases, Seoul, Korea (2006)
- (2006) Generic Entity Resolution with Data Confidences
- Menestrina, D.¹ Benjelloun, O.² Garcia-Molina, H.³

73
- 0004255908
- McGraw-Hill New York
- Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
- (1997) Machine Learning
- Mitchell, T.M.¹

74
- 0004043396
- Proc. of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery Tucson Arizona USA
- Monge, A.E., Elkan, C.P.: An Efficient Domain-Independent Algorithm For Detecting Approximately Duplicate Database Records. Proc. of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, Arizona, USA, pp. 23-29 (1997)
- (1997) An Efficient Domain-Independent Algorithm For Detecting Approximately Duplicate Database Records , pp. 23-29
- Monge, A.E.¹ Elkan, C.P.²

75
- 85018108837
- Portland Oregon USA
- Monge, A.E., Elkan, C.P.: The Field Matching Problem: Algorithms and Applications. Proc. of Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, USA, pp. 267-270 (1996)
- (1996) The Field Matching Problem: Algorithms and Applications. Proc. of Int. Conf. on Knowledge Discovery and Data Mining , pp. 267-270
- Monge, A.E.¹ Elkan, C.P.²

76
- 35048865820
- Proc. of CoopIS/DOA/ODBASE Int. Conf., Agia Napa Cyprus
- Mukherjee, S., Ramakrishnan, I.V.: Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences. Proc. of CoopIS/DOA/ODBASE Int. Conf., Agia Napa, Cyprus, pp. 909-926 (2004)
- (2004) Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences , pp. 909-926
- Mukherjee, S.¹ Ramakrishnan, I.V.²

77
- 0028959905
- Evaluating the quality of anonymous record linkage using deterministic procedures with the New York State AIDS registry and a hospital discharge file
- Muse, A.G., Mikl, J., Smith, P.F.: Evaluating the quality of anonymous record linkage using deterministic procedures with the New York State AIDS registry and a hospital discharge file. Statistics in Medicine 14, 499-509 (1995)
- (1995) Statistics in Medicine , vol.14 , pp. 499-509
- Muse, A.G.¹ Mikl, J.² Smith, P.F.³

78
- 80455138856
- Proc. KDD Workshop on Data Cleaning Record Linkage and Object Consolidation Washington DC USA
- Neiling, M., Jurk, S.: The Object Identification Framework. In: Proc. KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, USA, pp. 37-39 (2003)
- (2003) Object Identification Framework , pp. 37-39
- Neiling, M.¹ Jurk, S.²

79
- 33750548434
- Privacy issues in research using record linkage
- Neutel, C.I.: Privacy Issues in Research Using Record Linkage. Pharmcoepidemiology and Drug Safety 6, 367-369 (1997)
- (1997) Pharmcoepidemiology and Drug Safety , vol.6 , pp. 367-369
- Neutel, C.I.¹

80
- 0014087577
- Record linking: The design of efficient systems for linking records into individual and family histories
- Newcombe, H.B.: Record Linking: The Design of Efficient Systems for Linking Records into Individual and Family Histories. American Journal of Human Genetics 19, 335-359 (1967)
- (1967) American Journal of Human Genetics , vol.19 , pp. 335-359
- Newcombe, H.B.¹

81
- 0001592068
- Automatic linkage of vital records
- Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.:Automatic Linkage of Vital Records. Science 130, 954-959 (1959)
- (1959) Science , vol.130 , pp. 954-959
- Newcombe, H.B.¹ Kennedy, J.M.² Axford, S.J.³ James, A.P.⁴

82
- 0030157411
- In: Proc. of ACM Int. Conf. on Management of Data
- Patel, J., DeWitt, D.J.: Partition Based Spatial-Merge Join. Proc. of ACM Int. Conf. on Management of Data, pp. 259-270 (1996)
- (1996) Partition Based Spatial-Merge Join , pp. 259-270
- Patel, J.¹ DeWitt, D.J.²

83
- 85156206690
- In: Proc. of Ann. Conf. on Neural Information Processing Systems
- Pasula, H., Marthi, B., Milch, B., Russell, S.J., Shpitser, I.: Identity Uncertainty and Citation Matching. Proc. of Ann. Conf. on Neural Information Processing Systems, pp. 1401-1408 (2002)
- (2002) Identity Uncertainty and Citation Matching , pp. 1401-1408
- Pasula, H.¹ Marthi, B.² Milch, B.³ Russell, S.J.⁴ Shpitser, I.⁵

84
- 0242456811
- In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining Edmonton Alberta Canada
- Sarawagi, S., Bhamidipaty, A.: Interactive Deduplication using Active Learning. Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, pp. 269-278 (2002)
- (2002) Interactive Deduplication using Active Learning , pp. 269-278
- Sarawagi, S.¹ Bhamidipaty, A.²

85
- 3142777876
- Proc. of SIGMOD Int. Conf. on Management of Data Paris France
- Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. Proc. of SIGMOD Int. Conf. on Management of Data, Paris, France, pp. 743-754 (2004)
- (2004) Efficient Set Joins on Similarity Predicates , pp. 743-754
- Sarawagi, S.¹ Kirpal, A.²

86
- 57049103006
- Improved approximate detection of duplicates for data streams over sliding windows
- Shen, H., Zhang, Y.: Improved Approximate Detection of Duplicates for Data Streams over Sliding Windows. Journal of Computer Science and Technology 23(6), 973-987 (2008)
- (2008) Journal of Computer Science and Technology , vol.23 , Issue.6 , pp. 973-987
- Shen, H.¹ Zhang, Y.²

87
- 80455136873
- In: Proc. of ACM Int. Ws. on Multi-Relational Data Mining
- Singla, P., Domingos, P.: Multi-Relational Record Linkage. Proc. of ACM Int. Ws. on Multi-Relational Data Mining, pp. 31-38 (2004)
- (2004) Multi-Relational Record Linkage , pp. 31-38
- Singla, P.¹ Domingos, P.²

88
- 0019887799
- Identification of common molecular subsequences
- Smith, S., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147(1), 195-197 (1981)
- (1981) Journal of Molecular Biology , vol.147 , Issue.1 , pp. 195-197
- Smith, S.¹ Waterman, M.S.²

89
- 80455136872
- Statistical Linkage Key Working Group.
- Statistical Linkage Key Working Group. Statistical Data Linkage in Community Services Data Collections (2002)
- (2002) Statistical Data Linkage in Community Services Data Collections

90
- 0242456803
- In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining Edmonton Alberta Canada
- Tejada, S., Knoblock, C.A.,Minton, S.: Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification. Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, pp. 350-359 (2002)
- (2002) Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification , pp. 350-359
- Tejada, S.¹ Knoblock, C.A.² Minton, S.³

91
- 0009018963
- A model for optimum linkage of records
- Tepping, J.B.: A Model for Optimum Linkage of Records. Journal of the American Statistical Association 63, 1321-1332 (1968)
- (1968) Journal of the American Statistical Association , vol.63 , pp. 1321-1332
- Tepping, J.B.¹

92
- 0034228352
- Verykios, V.S., Elmagarmid, A.K., Houstis, E.N.: Automating the approximate recordmatching process. Inf. Sci. 126(1-4), 83-98 (2000)
- (2000) Automating the approximate recordmatching process. Inf. Sci. , vol.126 , Issue.1-4 , pp. 83-98
- Verykios, V.S.¹ Elmagarmid, A.K.² Houstis, E.N.³

93
- 0000681228
- Proc. of Int. Conf. on Very Large Databases, New York City USA
- Weber, R., Schek, H.J., Blott, S.: A Quantitative Analsysis and Performance Study for Similarity Search in High-Dimensional Spaces. Proc. of Int. Conf. on Very Large Databases, New York City, USA, pp. 194-205 (1998)
- (1998) A Quantitative Analsysis and Performance Study for Similarity Search in High-Dimensional Spaces , pp. 194-205
- Weber, R.¹ Schek, H.J.² Blott, S.³

94
- 33749618105
- In: Proc. of IEEE Int. Conf. on Data Engineering
- Weis,M., Naumann, N.: Detecting Duplicates in Complex XML Data. Proc. of IEEE Int. Conf. on Data Engineering, p. 109 (2006)
- (2006) Detecting Duplicates in Complex XML Data , pp. 109
- Weis, M.¹ Naumann, N.²

95
- 80455138853
- Potsdam, Germany
- Weis, M., Naumann, N.: Space and Time Scalability of Duplicate Detection in Graph Data. Tech. Rep. 25, Hasso-Plattner Institut, Potsdam, Germany (2007)
- (2007) Space and Time Scalability of Duplicate Detection in Graph Data. Tech. Rep. 25 Hasso-Plattner Institut
- Weis, M.¹ Naumann, N.²

96
- 0008976521
- In: Proc. Section on Survey Research Methods American Statistical Association
- Winkler,W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi- Sunter Model of Record Linkage. In: Proc. Section on Survey Research Methods, American Statistical Association, pp. 354-359 (1990)
- (1990) String Comparator Metrics and Enhanced Decision Rules in the Fellegi- Sunter Model of Record Linkage , pp. 354-359
- Winkler, W.E.¹

97
- 0012866045
- U.S. Census Bureau
- Winkler, W.E.: Overview of Record Linkage and Current Research Directions. Technical Report. Statistical Research Division, U.S. Census Bureau (1999)
- (1999) Overview of Record Linkage and Current Research Directions. Technical Report. Statistical Research Division
- Winkler, W.E.¹

98
- 2942741943
- Tech. Rep. RRS2002/05, U.S. Bureau of the Census Washington D.C. USA
- Winkler, W.E.: Methods for Record Linkage and Bayesian Networks. Tech. Rep. RRS2002/05, U.S. Bureau of the Census, Washington, D.C., USA (2002)
- (2002) Methods for Record Linkage and Bayesian Networks
- Winkler, W.E.¹

99
- 77649244160
- Duplicate-insensitive order statistics computation over data streams
- Zhang, Y., Lin, X., Yuan, Y., Kitsuregawa, M., Zhou, X., Yu, J.X.: Duplicate-insensitive Order Statistics Computation over Data Streams. IEEE Transanctions on Knowledge and Data Engineering 22(4), 493-507 (2010)
- (2010) IEEE Transanctions on Knowledge and Data Engineering , vol.22 , Issue.4 , pp. 493-507
- Zhang, Y.¹ Lin, X.² Yuan, Y.³ Kitsuregawa, M.⁴ Zhou, X.⁵ Yu, J.X.⁶

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.