SCOPUS 정보 검색 플랫폼

DTMBIO 2015 - Proceedings of the ACM 9th International Workshop on Data and Text Mining in Biomedical Informatics, co-located with CIKM 2015

Volumn , Issue , 2015, Pages 4-12

Evaluation of a machine learning duplicate detection method for bioinformatics databases

(3) Chen, Qingyu a Zobel, Justin a Verspoor, Karin a

a UNIVERSITY OF MELBOURNE (Australia)

Author keywords

[No Author keywords available]

Indexed keywords

ARTIFICIAL INTELLIGENCE; BIOINFORMATICS; DATA HANDLING; DATA MINING; DATABASE SYSTEMS; INFORMATION SCIENCE;

BIOINFORMATICS DATABASE; BIOLOGICAL DOMAIN; BIOLOGICAL ENTITIES; DATABASE RECORDS; DUPLICATE DETECTION; IMBALANCED DATA; LARGE DATASETS; MACHINE LEARNING TECHNIQUES;

LEARNING SYSTEMS;

EID: 84960860880 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/2811163.2811175 Document Type: Conference Paper

Times cited : (22)

References (33)

1
- 0028289467
- Issues in searching molecular sequence databases
- S. F. Altschul, M. S. Boguski, W. Gish, J. C. Wootton, et al. Issues in searching molecular sequence databases. Nature genetics, 6(2):119-129, 1994.
- (1994) Nature Genetics , vol.6 , Issue.2 , pp. 119-129
- Altschul, S.F.¹ Boguski, M.S.² Gish, W.³ Wootton, J.C.⁴

2
- 13444273448
- A. Bairoch, R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al. The universal protein resource (uniprot). Nucleic acids research, 33(suppl 1):D154-D159, 2005.
- (2005) The Universal Protein Resource (Uniprot). Nucleic Acids Research , vol.33 , pp. D154-D159
- Bairoch, A.¹ Apweiler, R.² Wu, C.H.³ Barker, W.C.⁴ Boeckmann, B.⁵ Ferro, S.⁶ Gasteiger, E.⁷ Huang, H.⁸ Lopez, R.⁹ Magrane, M.¹⁰

3
- 0028354228
- Blood pressure measurement error: Its effect on cross-sectional and trend analyses
- S. Bennett. Blood pressure measurement error: its effect on cross-sectional and trend analyses. Journal of clinical epidemiology, 47(3):293-301, 1994.
- (1994) Journal of Clinical Epidemiology , vol.47 , Issue.3 , pp. 293-301
- Bennett, S.¹

4
- 85053815327
- D. A. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers. Genbank. Nucleic acids research, page gks1195, 2012.
- (2012) Genbank. Nucleic Acids Research , pp. gks1195
- Benson, D.A.¹ Cavanaugh, M.² Clark, K.³ Karsch-Mizrachi, I.⁴ Lipman, D.J.⁵ Ostell, J.⁶ Sayers, E.W.⁷

5
- 85027084922
- D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler. Genbank. Nucleic acids research, page gks1195, 2006.
- (2006) Genbank. Nucleic Acids Research , pp. gks1195
- Benson, D.A.¹ Karsch-Mizrachi, I.² Lipman, D.J.³ Ostell, J.⁴ Wheeler, D.L.⁵

6
- 0030271920
- Go hunting in sequence databases but watch out for the traps
- P. Bork and A. Bairoch. Go hunting in sequence databases but watch out for the traps. Trends in Genetics, 12(10):425-427, 1996.
- (1996) Trends in Genetics , vol.12 , Issue.10 , pp. 425-427
- Bork, P.¹ Bairoch, A.²

7
- 34447248788
- Clustered sequence representation for fast homology search
- M. Cameron, Y. Bernstein, and H. E. Williams. Clustered sequence representation for fast homology search. Journal of Computational Biology, 14(5):594-614, 2007.
- (2007) Journal of Computational Biology , vol.14 , Issue.5 , pp. 594-614
- Cameron, M.¹ Bernstein, Y.² Williams, H.E.³

8
- 84949153227
- Detecting redundancy in biological databases? An efficient approach
- S. Chellamuthu and D. M. Punithavalli. Detecting redundancy in biological databases? an efficient approach. Global Journal of Computer Science and Technology, 9(4), 2009.
- (2009) Global Journal of Computer Science and Technology , vol.9 , Issue.4
- Chellamuthu, S.¹ Punithavalli, D.M.²

9
- 0035452641
- Efficient data reconciliation
- M. Cochinwala, V. Kurien, G. Lalk, and D. Shasha. Efficient data reconciliation. Information Sciences, 137(1):1-15, 2001.
- (2001) Information Sciences , vol.137 , Issue.1 , pp. 1-15
- Cochinwala, M.¹ Kurien, V.² Lalk, G.³ Shasha, D.⁴

10
- 0035424599
- Intrinsic errors in genome annotation
- D. Devos and A. Valencia. Intrinsic errors in genome annotation. TRENDS in Genetics, 17(8):429-431, 2001.
- (2001) TRENDS in Genetics , vol.17 , Issue.8 , pp. 429-431
- Devos, D.¹ Valencia, A.²

11
- 33845667955
- Duplicate record detection: A survey. Knowledge and data engineering
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. Knowledge and Data Engineering, IEEE Transactions on, 19(1):1-16, 2007.
- (2007) IEEE Transactions on , vol.19 , Issue.1 , pp. 1-16
- Elmagarmid, A.K.¹ Ipeirotis, P.G.² Verykios, V.S.³

12
- 84865627282
- Data quality: Theory and practice
- Springer
- W. Fan. Data quality: Theory and practice. In Web-Age Information Management, pages 1-16. Springer, 2012.
- (2012) Web-Age Information Management , pp. 1-16
- Fan, W.¹

13
- 0344756845
- Declarative data cleaning: Language model, and algorithms
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model, and algorithms. Proc. 27th IntâAZl Conf. Very Large Databases, 2001.
- (2001) Proc. 27th IntâAZl Conf. Very Large Databases
- Galhardas, H.¹ Florescu, D.² Shasha, D.³ Simon, E.⁴ Saita, C.⁵

14
- 0029868410
- Cleanup: A fast computer program for removing redundancies from nucleotide sequence databases
- G. Grillo, M. Attimonelli, S. Liuni, and G. Pesole. Cleanup: a fast computer program for removing redundancies from nucleotide sequence databases. Computer applications in the biosciences: CABIOS, 12(1):1-8, 1996.
- (1996) Computer Applications in the Biosciences: CABIOS , vol.12 , Issue.1 , pp. 1-8
- Grillo, G.¹ Attimonelli, M.² Liuni, S.³ Pesole, G.⁴

15
- 84958071711
- Introduction to arules-A computational environment for mining association rules and frequent item sets
- M. Hahsler, B. Grün, K. Hornik, and C. Buchta. Introduction to arules-a computational environment for mining association rules and frequent item sets. The Comprehensive R Archive Network, 2009.
- (2009) The Comprehensive R Archive Network
- Hahsler, M.¹ Grün, B.² Hornik, K.³ Buchta, C.⁴

16
- 0031829372
- Removing near-neighbour redundancy from large protein sequence collections
- L. Holm and C. Sander. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics, 14(5):423-429, 1998.
- (1998) Bioinformatics , vol.14 , Issue.5 , pp. 423-429
- Holm, L.¹ Sander, C.²

17
- 84960859412
- Duplicate detection in biological data using association rule mining
- P34180
- J. L. Koh, M. L. Lee, A. M. Khan, P. T. Tan, and V. Brusic. Duplicate detection in biological data using association rule mining. Locus, 501(P34180):S22388, 2004.
- (2004) Locus , vol.501 , pp. S22388
- Koh, J.L.¹ Lee, M.L.² Khan, A.M.³ Tan, P.T.⁴ Brusic, V.⁵

18
- 0035239194
- Turkey and chicken interferon-, which share high sequence identity, are biologically cross-reactive
- S. Lawson, L. Rothwell, B. Lambrecht, K. Howes, K. Venugopal, and P. Kaiser. Turkey and chicken interferon-, which share high sequence identity, are biologically cross-reactive. Developmental & Comparative Immunology, 25(1):69-82, 2001.
- (2001) Developmental & Comparative Immunology , vol.25 , Issue.1 , pp. 69-82
- Lawson, S.¹ Rothwell, L.² Lambrecht, B.³ Howes, K.⁴ Venugopal, K.⁵ Kaiser, P.⁶

19
- 33745634395
- Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences
- W. Li and A. Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658-1659, 2006.
- (2006) Bioinformatics , vol.22 , Issue.13 , pp. 1658-1659
- Li, W.¹ Godzik, A.²

20
- 0036699189
- Sequence clustering strategies improve remote homology recognitions while reducing search times
- W. Li, L. Jaroszewski, and A. Godzik. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein engineering, 15(8):643-649, 2002.
- (2002) Protein Engineering , vol.15 , Issue.8 , pp. 643-649
- Li, W.¹ Jaroszewski, L.² Godzik, A.³

21
- 26444439362
- Data quality in genome databases
- H. Müller, F. Naumann, and J.-C. Freytag. Data quality in genome databases. Eighth International Conference on Information Quality (IQ 2003), 2003.
- (2003) Eighth International Conference on Information Quality (IQ 2003)
- Müller, H.¹ Naumann, F.² Freytag, J.-C.³

22
- 77954714476
- Detecting duplicate biological entities using shortest path edit distance
- A. Rudniy, M. Song, and J. Geller. Detecting duplicate biological entities using shortest path edit distance. International journal of data mining and bioinformatics, 4(4):395-410, 2010.
- (2010) International Journal of Data Mining and Bioinformatics , vol.4 , Issue.4 , pp. 395-410
- Rudniy, A.¹ Song, M.² Geller, J.³

23
- 84867684415
- Protein sequence redundancy reduction: Comparison of various method
- K. Sikic and O. Carugo. Protein sequence redundancy reduction: comparison of various method. Bioinformation, 5(6):234, 2010.
- (2010) Bioinformation , vol.5 , Issue.6 , pp. 234
- Sikic, K.¹ Carugo, O.²

24
- 78049440735
- Detecting duplicate biological entities using markov random field-based edit distance
- M. Song and A. Rudniy. Detecting duplicate biological entities using markov random field-based edit distance. Knowledge and information systems, 25(2):371-387, 2010.
- (2010) Knowledge and Information Systems , vol.25 , Issue.2 , pp. 371-387
- Song, M.¹ Rudniy, A.²

25
- 34347388470
- Uniref: Comprehensive and non-redundant uniprot reference clusters
- B. E. Suzek, H. Huang, P. McGarvey, R. Mazumder, and C. H. Wu. Uniref: comprehensive and non-redundant uniprot reference clusters. Bioinformatics, 23(10):1282-1288, 2007.
- (2007) Bioinformatics , vol.23 , Issue.10 , pp. 1282-1288
- Suzek, B.E.¹ Huang, H.² McGarvey, P.³ Mazumder, R.⁴ Wu, C.H.⁵

26
- 0032964297
- Blast 2 sequences a new tool for comparing protein and nucleotide sequences
- T. A. Tatusova and T. L. Madden. Blast 2 sequences, a new tool for comparing protein and nucleotide sequences. FEMS microbiology letters, 174(2):247-250, 1999.
- (1999) FEMS Microbiology Letters , vol.174 , Issue.2 , pp. 247-250
- Tatusova, T.A.¹ Madden, T.L.²

27
- 84946564866
- R. C. Team. R language definition, 2000.
- (2000) R Language Definition

28
- 0035545848
- Learning object identification rules for information integration
- S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems, 26(8):607-633, 2001.
- (2001) Information Systems , vol.26 , Issue.8 , pp. 607-633
- Tejada, S.¹ Knoblock, C.A.² Minton, S.³

29
- 33846897439
- Using duplicate genotyped data in genetic analyses: Testing association and estimating error rates
- N. L. Tintle, D. Gordon, F. J. McMahon, and S. J. Finch. Using duplicate genotyped data in genetic analyses: testing association and estimating error rates. Statistical applications in genetics and molecular biology, 6(1), 2007.
- (2007) Statistical Applications in Genetics and Molecular Biology , vol.6 , Issue.1
- Tintle, N.L.¹ Gordon, D.² McMahon, F.J.³ Finch, S.J.⁴

30
- 0034228352
- Automating the approximate record-matching process
- V. S. Verykios, A. K. Elmagarmid, and E. N. Houstis. Automating the approximate record-matching process. Information sciences, 126(1):83-98, 2000.
- (2000) Information Sciences , vol.126 , Issue.1 , pp. 83-98
- Verykios, V.S.¹ Elmagarmid, A.K.² Houstis, E.N.³

31
- 34748898358
- The current state of business intelligence
- H. J. Watson and B. H. Wixom. The current state of business intelligence. Computer, 40(9):96-99, 2007.
- (2007) Computer , vol.40 , Issue.9 , pp. 96-99
- Watson, H.J.¹ Wixom, B.H.²

32
- 84870254132
- Molecular phylogeny of north American branchiobdellida (annelida: Clitellata)
- B. W. Williams, S. R. Gelder, H. C. Proctor, and D. W. Coltman. Molecular phylogeny of north american branchiobdellida (annelida: Clitellata). Molecular phylogenetics and evolution, 66(1):30-42, 2013.
- (2013) Molecular Phylogenetics and Evolution , vol.66 , Issue.1 , pp. 30-42
- Williams, B.W.¹ Gelder, S.R.² Proctor, H.C.³ Coltman, D.W.⁴

33
- 84931084134
- Starcode: Sequence clustering based on all-pairs search
- E. V. Zorita, P. Cuscó, and G. Filion. Starcode: sequence clustering based on all-pairs search. Bioinformatics, page btv053, 2015.
- (2015) Bioinformatics , pp. btv053
- Zorita, E.V.¹ Cuscó, P.² Filion, G.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.