SCOPUS 정보 검색 플랫폼

Volumn 287, Issue 5461, 2000, Pages 2196-2204

A whole-genome assembly of Drosophila

(29) Myers, Eugene W a Sutton, Granger G a Delcher, Art L a Dew, Ian M a Fasulo, Dan P a Flanigan, Michael J a Kravitz, Saul A a Mobarry, Clark M a Reinert, Knut H J a Remington, Karin A a Anson, Eric L a Bolanos, Randall A a Chou, Hui Hsien a Jordan, Catherine M a Halpern, Aaron L a Lonardi, Stefano a Beasley, Ellen M a Brandon, Rhonda C a Chen, Lin a Dunn, Patrick J a more..

a CELERA GENOMICS (United States)

b UNIVERSITY OF CALIFORNIA (United States)

Author keywords

[No Author keywords available]

Indexed keywords

CENTROMERE; DEVELOPMENTAL GENETICS; DROSOPHILA MELANOGASTER; EUCHROMATIN; GENOME; NONHUMAN; NUCLEOTIDE SEQUENCE; PRIORITY JOURNAL; PROTEIN ASSEMBLY; REVIEW; SEQUENCE ANALYSIS; UNINDEXED SEQUENCE;

ALGORITHMS; ANIMALS; CHROMATIN; COMPUTATIONAL BIOLOGY; CONTIG MAPPING; DROSOPHILA MELANOGASTER; EUCHROMATIN; GENES, INSECT; GENOME; HETEROCHROMATIN; MOLECULAR SEQUENCE DATA; PHYSICAL CHROMOSOME MAPPING; REPETITIVE SEQUENCES, NUCLEIC ACID; SEQUENCE ANALYSIS, DNA; SEQUENCE TAGGED SITES;

DROSOPHILA MELANOGASTER; MELANOGASTER;

EID: 0034708758 PISSN: 00368075 EISSN: None Source Type: Journal
DOI: 10.1126/science.287.5461.2196 Document Type: Review

Times cited : (1243)

References (42)

1
- 0040342928
- F. Sanger, S. Nicklen, A. R. Coulson, Proc. Natl. Acad. Sci. U.S.A. 74, 12 (1977).
- (1977) Proc. Natl. Acad. Sci. U.S.A. , vol.74 , pp. 12
- Sanger, F.¹ Nicklen, S.² Coulson, A.R.³

2
- 15444350252
- F. R. Blattner et al., Science 277, 1453 (1997).
- (1997) Science , vol.277 , pp. 1453
- Blattner, F.R.¹

3
- 8544240102
- H. W. Mewes et al., Nature 387 (6632 suppl.), 7 (1997).
- (1997) Nature , vol.387 , Issue.6632 SUPPL. , pp. 7
- Mewes, H.W.¹

4
- 0032509302
- The C. elegans Sequencing Consortium, Science 282, 2012 (1998).
- (1998) Science , vol.282 , pp. 2012

5
- 0029653518
- R. D. Fleischmann et al., Science 269, 496 (1995).
- (1995) Science , vol.269 , pp. 496
- Fleischmann, R.D.¹

6
- 0039750838
- F. Sanger, A. R. Coulson, G. F. Hong, D. F. Hill, G. B. Petersen, J. Mol. Biol. 162, 4 (1982).
- (1982) J. Mol. Biol. , vol.162 , pp. 4
- Sanger, F.¹ Coulson, A.R.² Hong, G.F.³ Hill, D.F.⁴ Petersen, G.B.⁵

7
- 0029955205
- J. C. Venter, H. O. Smith, L. Hood, Nature 381, 364 (1996).
- (1996) Nature , vol.381 , pp. 364
- Venter, J.C.¹ Smith, H.O.² Hood, L.³

8
- 0030950735
- J. Weber and H. Myers, Genome Res. 7, 401 (1997).
- (1997) Genome Res. , vol.7 , pp. 401
- Weber, J.¹ Myers, H.²

9
- 0030972827
- P. Green, Genome Res. 7, 410 (1997).
- (1997) Genome Res. , vol.7 , pp. 410
- Green, P.¹

10
- 0032486191
- J. C. Venter et al., Science 280, 1540 (1998).
- (1998) Science , vol.280 , pp. 1540
- Venter, J.C.¹

11
- 0034708480
- M. D. Adams et al., Science 287, 2185 (2000).
- (2000) Science , vol.287 , pp. 2185
- Adams, M.D.¹

12
- 0034708444
- G. Rubin et al., Science 287, 2204 (2000).
- (2000) Science , vol.287 , pp. 2204
- Rubin, G.¹

13
- 0343688260
- note
- In addition to judicious algorithmic application of mate pairs, we are fortunate because with the introduction of capillary get sequencers, the primary source of false-pairing information for end reads disappears, as a sample is now forced to flow down a tube as opposed to meandering over a slab gel. With careful robotics and library construction the false-pairing rate on mate pairs can be kept to less than 1%.

14
- 0343688261
- The ES40 utilizes a 667-MHz Alpha 21264a processor running Tru64 UNIX. Each CPU receives a score of 413 and 500, respectively, for the integer and floating point SPEC 2000 benchmarks (see www.spec.org).

15
- 0040342927
- Heidelberg, Germany, August T. Lengauer et al., Eds. (American Association for Artificial Intelligence, Menlo Park, CA, 1999)
- During the first 9 months of development, when no significant amount of real data was available, we used a simulator called celsim [E. W. Myers, in Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, Heidelberg, Germany, August 1999; T. Lengauer et al., Eds. (American Association for Artificial Intelligence, Menlo Park, CA, 1999), pp. 202-210] that could either take a mosaic of known sequence, for example, Caenorhabditis elegans or Saccharomyces cerevisiae, and simulate the shotgun process on it or generate synthetic DNA with controllable repeat characteristics and simulate a shotgun of it.
- (1999) Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology , pp. 202-210
- Myers, E.W.¹

16
- 0343252597
- note
- Starting with the raw sequencing data, one generally trims off a prefix and suffix of a read that is too inaccurate. In the early days, when assembly software was not very robust, one trimmed aggressively to be certain of a having a highly accurate remainder. But longer reads imply less oversampling and hence greater efficiency, so as software improved, the cutoff was relaxed to the point where today many use and report read lengths for trimming at the 90 to 95% accuracy level. One must return to a tight 98% stringency for whole-genome shotgun sequencing in order to avoid false overlaps.

17
- 0031955518
- We used a software package developed by Paracel, Inc., building on the ideas originally published by B. Ewing, L. Hillier, M. C. Wendi, and P. Green [Genome Res. 8, 175 (1998)].
- (1998) Genome Res. , vol.8 , pp. 175
- Ewing, B.¹ Hillier, L.² Wendi, M.C.³ Green, P.⁴

18
- 0342817981
- note
- The accuracy of reads was evaluated by finding all reads that were sampled from the 29 Mbp of finished Drosophila sequence produced by the BDGP and EDGP. Comparison of the trimmed portions of such reads against the finished sequence was used to determine the accuracy of the read.

19
- 0342817979
- note
- The largest source of error in pairing, lane tracking error, disappears with capillary gel sequencing. The remaining sources of error are chimerism in the insert library and sample tracking in the lab. One of the significant advantages of a whole genome approach is that in concept only three libraries are needed, so one can very carefully craft and assure the quality of these libraries. We estimated the chimerism rate of the library at 0.01% and required the laboratory protocols to be of sufficient quality that a plate would correctly track through the entire sequencing pipeline at least 99.5% of the time. The actual false-pairing rate was measured by examining all pairs wherein one read could uniquely be localized to the interior portion of a finished BAC sequence of Drosophila.

20
- 0034708791
- R. A. Hoskins et al., Science 287, 2271 (2000).
- (2000) Science , vol.287 , pp. 2271
- Hoskins, R.A.¹

21
- 0343252578
- note
- A perfect shredding of a sequence is a set of reads covering the sequence that (i) all have the same length; and (ii) overlap the read that immediately follows them by exactly the same amount.

22
- 0025183708
- S. Altschul, W. Gish, W. Miller, E. W. Myers, D. Lipman, J. Mol. Biol. 215, 403 (1990).
- (1990) J. Mol. Biol. , vol.215 , pp. 403
- Altschul, S.¹ Gish, W.² Miller, W.³ Myers, E.W.⁴ Lipman, D.⁵

23
- 0039158407
- E. W. Myers, J. Comp. Biol. 2, 2 (1995).
- (1995) J. Comp. Biol. , vol.2 , pp. 2
- Myers, E.W.¹

24
- 0342817952
- note
- k/ kl]exp(-2pF/G). The log of the ratio of these two probabilities, (log e)pF/G - (log 2)k, is our A statistic.

25
- 0040342948
- G. Churchill and M. Waterman, Genomics 14, 89 (1982).
- (1982) Genomics , vol.14 , pp. 89
- Churchill, G.¹ Waterman, M.²

26
- 0342383050
- note
- The computer program continues to be refined. The computer algorithms used and the correct version described here are the subject matter of a pending patent application. We are open to collaborations involving this software under terms beneficial to all parties.

27
- 0342817948
- The Institute for Genome Research, personal communication. The data were taken from 10 randomly sampled BACs from Arab/dopsis thaliana
- S. Salzburg, The Institute for Genome Research, personal communication. The data were taken from 10 randomly sampled BACs from Arab/dopsis thaliana.
- Salzburg, S.¹

28
- 0343252574
- Because of the variable status of the STS map across chromosomes, several protocols were required to find these sequences and to compare them against the WGS assembly. For the X chromosome, we used public sequences from cosmid fragments wherever there was complete sequence available. Sequence tags were identified by searching for markers within the cytogenetic regions included on the X chromosome in Flybase, and BAC end sequences mapped to the X and STS-content map contigs from Berkeley, ordered only by their reported cytological range (see www.fruitfly.org and http://flybase.bio.indiana.edu). For chromosomes Z and 3, we used the 40-bp overgo sequences from the Berkeley STS-content map, which were ordered by BDGP against panels of BACs (20). For chromosomes 4 and Y, we used publicly available sequences ordered based on their cytology, including genes and one set of cosmid ends (see www.fruitfly.org and flybase.bio.indiana.edu). BLAST (38) was used to locate the sequence tags in the assembly, using 95% identity and length of 99 bp as our cutoff for tags on chromosomes X, 4, and Y; the length cutoff was reduced to 38 bp for the 40-bp overgo sequences on chromosomes 2 and 3. We regarded STSs hitting the assembly many times as unreliable for sequence localization for the purposes of this study, and such sequences were eliminated from consideration. When a scaffold showed an inconsistent map association, the location of the STS sequence was then checked against the sequence in the done-tiling path map.

29
- 0342383047
- Because of the variable status of the STS map across chromosomes, several protocols were required to find these sequences and to compare them against the WGS assembly. For the X chromosome, we used public sequences from cosmid fragments wherever there was complete sequence available. Sequence tags were identified by searching for markers within the cytogenetic regions included on the X chromosome in Flybase, and BAC end sequences mapped to the X and STS-content map contigs from Berkeley, ordered only by their reported cytological range (see www.fruitfly.org and http://flybase.bio.indiana.edu). For chromosomes Z and 3, we used the 40-bp overgo sequences from the Berkeley STS-content map, which were ordered by BDGP against panels of BACs (20). For chromosomes 4 and Y, we used publicly available sequences ordered based on their cytology, including genes and one set of cosmid ends (see www.fruitfly.org and flybase.bio.indiana.edu). BLAST (38) was used to locate the sequence tags in the assembly, using 95% identity and length of 99 bp as our cutoff for tags on chromosomes X, 4, and Y; the length cutoff was reduced to 38 bp for the 40-bp overgo sequences on chromosomes 2 and 3. We regarded STSs hitting the assembly many times as unreliable for sequence localization for the purposes of this study, and such sequences were eliminated from consideration. When a scaffold showed an inconsistent map association, the location of the STS sequence was then checked against the sequence in the done-tiling path map.

30
- 0343252567
- note
- A total of 199 STSs was not located on the assembly, 57% of which are on the X chromosome.

31
- 0343688242
- -30 and identity cutoff values of 99% for finished data and 95% for light-shotgun data were used in the BLAST comparisons.

32
- 0342817940
- note
- To resolve discrepancies, the validity of both orders was examined by sequence comparison between the discrepant clone and assembly region and all other clone sequences in the tiling path. Further, the clone and assembly regions in question were compared with the STS-content map. If the weight of evidence data supported the order of either the assembly or the tiling path, we concluded that the supported order was correct.

33
- 0343252573
- note
- The first 30 kbp of tiling clone DS08493 does not match the Celera contig covering the entire region of the tiling path. The adjacent tiling clone BACR11E09 does not match DS08493 across the chimeric junction; instead it agrees with the Celera contig. The GenBank accession number of the clone in question is AC004422.

34
- 0342817938
- note
- 6. For the WGS assembly, we identified 1380 blocks that were hit by less than 500 bp of clone sequence and 794 blocks that were completely missed by the clone sequence. The total number of missed blocks is 2174, which represents a total 15.2 Mbp.

35
- 0032857480
- M. Ashburner et al., Genetics 153, 179 (1999).
- (1999) Genetics , vol.153 , pp. 179
- Ashburner, M.¹

36
- 0342817937
- note
- Seven conflicts were identified in this study, six of which appear to be owing to transposable elements. The remaining represents a 30-kbp insert within a Celera contig that does not match the corresponding clone. This discrepancy is still under investigation.

37
- 0342817936
- www.sciencemag.org/feature/data/1049666.shl

38
- 0030801002
- S. Altschul et al., Nucleic Acids. Res. 25, 3389 (1997).
- (1997) Nucleic Acids. Res. , vol.25 , pp. 3389
- Altschul, S.¹

39
- 0342383040
- personal communication
- R. A. Hoskins, personal communication.
- Hoskins, R.A.¹

40
- 0342383044
- note
- In order to align the Celera sequences unambiguously to the external data, all significant HSPs at the parameters given in (27) were screened to identify "mutually unique regions" where the clone and contig sequences have a unique, reciprocal match relation.

41
- 0342383041
- note
- Most negative gaps arise because of inaccuracies in the distances implied by bundles - the bundle implies a small amount of overlap between two contigs because it is actually short, whereas the reality is that there is a small gap at that location. In a very small number of cases, there is an overlap, but it is because the distance estimate is too long by 3 standard deviations, or because there is a small bit of foreign DNA at the tip of a contig because of untrimmed vector or a chimeric read. None of these negative gaps has yet been found to imply incorrect assembly.

42
- 0342383043
- note
- We wish to thank H. Smith and S. Salzburg for the many collegial exchanges, M. Peterson and his team for keeping the machines humming, R. Thompson and his staff for providing us with an environment conducive to such an intense effort, and A. Glodek, C. Kraft, and A. Deslattes Mays, and their staff for getting the data to us.

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.