메뉴 건너뛰기




Volumn 287, Issue 5461, 2000, Pages 2196-2204

A whole-genome assembly of Drosophila

Author keywords

[No Author keywords available]

Indexed keywords

CENTROMERE; DEVELOPMENTAL GENETICS; DROSOPHILA MELANOGASTER; EUCHROMATIN; GENOME; NONHUMAN; NUCLEOTIDE SEQUENCE; PRIORITY JOURNAL; PROTEIN ASSEMBLY; REVIEW; SEQUENCE ANALYSIS; UNINDEXED SEQUENCE;

EID: 0034708758     PISSN: 00368075     EISSN: None     Source Type: Journal    
DOI: 10.1126/science.287.5461.2196     Document Type: Review
Times cited : (1243)

References (42)
  • 2
    • 15444350252 scopus 로고    scopus 로고
    • F. R. Blattner et al., Science 277, 1453 (1997).
    • (1997) Science , vol.277 , pp. 1453
    • Blattner, F.R.1
  • 3
    • 8544240102 scopus 로고    scopus 로고
    • H. W. Mewes et al., Nature 387 (6632 suppl.), 7 (1997).
    • (1997) Nature , vol.387 , Issue.6632 SUPPL. , pp. 7
    • Mewes, H.W.1
  • 4
    • 0032509302 scopus 로고    scopus 로고
    • The C. elegans Sequencing Consortium, Science 282, 2012 (1998).
    • (1998) Science , vol.282 , pp. 2012
  • 10
    • 0032486191 scopus 로고    scopus 로고
    • J. C. Venter et al., Science 280, 1540 (1998).
    • (1998) Science , vol.280 , pp. 1540
    • Venter, J.C.1
  • 11
    • 0034708480 scopus 로고    scopus 로고
    • M. D. Adams et al., Science 287, 2185 (2000).
    • (2000) Science , vol.287 , pp. 2185
    • Adams, M.D.1
  • 12
    • 0034708444 scopus 로고    scopus 로고
    • G. Rubin et al., Science 287, 2204 (2000).
    • (2000) Science , vol.287 , pp. 2204
    • Rubin, G.1
  • 13
    • 0343688260 scopus 로고    scopus 로고
    • note
    • In addition to judicious algorithmic application of mate pairs, we are fortunate because with the introduction of capillary get sequencers, the primary source of false-pairing information for end reads disappears, as a sample is now forced to flow down a tube as opposed to meandering over a slab gel. With careful robotics and library construction the false-pairing rate on mate pairs can be kept to less than 1%.
  • 14
    • 0343688261 scopus 로고    scopus 로고
    • The ES40 utilizes a 667-MHz Alpha 21264a processor running Tru64 UNIX. Each CPU receives a score of 413 and 500, respectively, for the integer and floating point SPEC 2000 benchmarks (see www.spec.org).
  • 15
    • 0040342927 scopus 로고    scopus 로고
    • Heidelberg, Germany, August T. Lengauer et al., Eds. (American Association for Artificial Intelligence, Menlo Park, CA, 1999)
    • During the first 9 months of development, when no significant amount of real data was available, we used a simulator called celsim [E. W. Myers, in Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, Heidelberg, Germany, August 1999; T. Lengauer et al., Eds. (American Association for Artificial Intelligence, Menlo Park, CA, 1999), pp. 202-210] that could either take a mosaic of known sequence, for example, Caenorhabditis elegans or Saccharomyces cerevisiae, and simulate the shotgun process on it or generate synthetic DNA with controllable repeat characteristics and simulate a shotgun of it.
    • (1999) Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology , pp. 202-210
    • Myers, E.W.1
  • 16
    • 0343252597 scopus 로고    scopus 로고
    • note
    • Starting with the raw sequencing data, one generally trims off a prefix and suffix of a read that is too inaccurate. In the early days, when assembly software was not very robust, one trimmed aggressively to be certain of a having a highly accurate remainder. But longer reads imply less oversampling and hence greater efficiency, so as software improved, the cutoff was relaxed to the point where today many use and report read lengths for trimming at the 90 to 95% accuracy level. One must return to a tight 98% stringency for whole-genome shotgun sequencing in order to avoid false overlaps.
  • 17
    • 0031955518 scopus 로고    scopus 로고
    • We used a software package developed by Paracel, Inc., building on the ideas originally published by B. Ewing, L. Hillier, M. C. Wendi, and P. Green [Genome Res. 8, 175 (1998)].
    • (1998) Genome Res. , vol.8 , pp. 175
    • Ewing, B.1    Hillier, L.2    Wendi, M.C.3    Green, P.4
  • 18
    • 0342817981 scopus 로고    scopus 로고
    • note
    • The accuracy of reads was evaluated by finding all reads that were sampled from the 29 Mbp of finished Drosophila sequence produced by the BDGP and EDGP. Comparison of the trimmed portions of such reads against the finished sequence was used to determine the accuracy of the read.
  • 19
    • 0342817979 scopus 로고    scopus 로고
    • note
    • The largest source of error in pairing, lane tracking error, disappears with capillary gel sequencing. The remaining sources of error are chimerism in the insert library and sample tracking in the lab. One of the significant advantages of a whole genome approach is that in concept only three libraries are needed, so one can very carefully craft and assure the quality of these libraries. We estimated the chimerism rate of the library at 0.01% and required the laboratory protocols to be of sufficient quality that a plate would correctly track through the entire sequencing pipeline at least 99.5% of the time. The actual false-pairing rate was measured by examining all pairs wherein one read could uniquely be localized to the interior portion of a finished BAC sequence of Drosophila.
  • 20
    • 0034708791 scopus 로고    scopus 로고
    • R. A. Hoskins et al., Science 287, 2271 (2000).
    • (2000) Science , vol.287 , pp. 2271
    • Hoskins, R.A.1
  • 21
    • 0343252578 scopus 로고    scopus 로고
    • note
    • A perfect shredding of a sequence is a set of reads covering the sequence that (i) all have the same length; and (ii) overlap the read that immediately follows them by exactly the same amount.
  • 24
    • 0342817952 scopus 로고    scopus 로고
    • note
    • k/ kl]exp(-2pF/G). The log of the ratio of these two probabilities, (log e)pF/G - (log 2)k, is our A statistic.
  • 26
    • 0342383050 scopus 로고    scopus 로고
    • note
    • The computer program continues to be refined. The computer algorithms used and the correct version described here are the subject matter of a pending patent application. We are open to collaborations involving this software under terms beneficial to all parties.
  • 27
    • 0342817948 scopus 로고    scopus 로고
    • The Institute for Genome Research, personal communication. The data were taken from 10 randomly sampled BACs from Arab/dopsis thaliana
    • S. Salzburg, The Institute for Genome Research, personal communication. The data were taken from 10 randomly sampled BACs from Arab/dopsis thaliana.
    • Salzburg, S.1
  • 28
    • 0343252574 scopus 로고    scopus 로고
    • Because of the variable status of the STS map across chromosomes, several protocols were required to find these sequences and to compare them against the WGS assembly. For the X chromosome, we used public sequences from cosmid fragments wherever there was complete sequence available. Sequence tags were identified by searching for markers within the cytogenetic regions included on the X chromosome in Flybase, and BAC end sequences mapped to the X and STS-content map contigs from Berkeley, ordered only by their reported cytological range (see www.fruitfly.org and http://flybase.bio.indiana.edu). For chromosomes Z and 3, we used the 40-bp overgo sequences from the Berkeley STS-content map, which were ordered by BDGP against panels of BACs (20). For chromosomes 4 and Y, we used publicly available sequences ordered based on their cytology, including genes and one set of cosmid ends (see www.fruitfly.org and flybase.bio.indiana.edu). BLAST (38) was used to locate the sequence tags in the assembly, using 95% identity and length of 99 bp as our cutoff for tags on chromosomes X, 4, and Y; the length cutoff was reduced to 38 bp for the 40-bp overgo sequences on chromosomes 2 and 3. We regarded STSs hitting the assembly many times as unreliable for sequence localization for the purposes of this study, and such sequences were eliminated from consideration. When a scaffold showed an inconsistent map association, the location of the STS sequence was then checked against the sequence in the done-tiling path map.
  • 29
    • 0342383047 scopus 로고    scopus 로고
    • Because of the variable status of the STS map across chromosomes, several protocols were required to find these sequences and to compare them against the WGS assembly. For the X chromosome, we used public sequences from cosmid fragments wherever there was complete sequence available. Sequence tags were identified by searching for markers within the cytogenetic regions included on the X chromosome in Flybase, and BAC end sequences mapped to the X and STS-content map contigs from Berkeley, ordered only by their reported cytological range (see www.fruitfly.org and http://flybase.bio.indiana.edu). For chromosomes Z and 3, we used the 40-bp overgo sequences from the Berkeley STS-content map, which were ordered by BDGP against panels of BACs (20). For chromosomes 4 and Y, we used publicly available sequences ordered based on their cytology, including genes and one set of cosmid ends (see www.fruitfly.org and flybase.bio.indiana.edu). BLAST (38) was used to locate the sequence tags in the assembly, using 95% identity and length of 99 bp as our cutoff for tags on chromosomes X, 4, and Y; the length cutoff was reduced to 38 bp for the 40-bp overgo sequences on chromosomes 2 and 3. We regarded STSs hitting the assembly many times as unreliable for sequence localization for the purposes of this study, and such sequences were eliminated from consideration. When a scaffold showed an inconsistent map association, the location of the STS sequence was then checked against the sequence in the done-tiling path map.
  • 30
    • 0343252567 scopus 로고    scopus 로고
    • note
    • A total of 199 STSs was not located on the assembly, 57% of which are on the X chromosome.
  • 31
    • 0343688242 scopus 로고    scopus 로고
    • -30 and identity cutoff values of 99% for finished data and 95% for light-shotgun data were used in the BLAST comparisons.
  • 32
    • 0342817940 scopus 로고    scopus 로고
    • note
    • To resolve discrepancies, the validity of both orders was examined by sequence comparison between the discrepant clone and assembly region and all other clone sequences in the tiling path. Further, the clone and assembly regions in question were compared with the STS-content map. If the weight of evidence data supported the order of either the assembly or the tiling path, we concluded that the supported order was correct.
  • 33
    • 0343252573 scopus 로고    scopus 로고
    • note
    • The first 30 kbp of tiling clone DS08493 does not match the Celera contig covering the entire region of the tiling path. The adjacent tiling clone BACR11E09 does not match DS08493 across the chimeric junction; instead it agrees with the Celera contig. The GenBank accession number of the clone in question is AC004422.
  • 34
    • 0342817938 scopus 로고    scopus 로고
    • note
    • 6. For the WGS assembly, we identified 1380 blocks that were hit by less than 500 bp of clone sequence and 794 blocks that were completely missed by the clone sequence. The total number of missed blocks is 2174, which represents a total 15.2 Mbp.
  • 36
    • 0342817937 scopus 로고    scopus 로고
    • note
    • Seven conflicts were identified in this study, six of which appear to be owing to transposable elements. The remaining represents a 30-kbp insert within a Celera contig that does not match the corresponding clone. This discrepancy is still under investigation.
  • 37
    • 0342817936 scopus 로고    scopus 로고
    • www.sciencemag.org/feature/data/1049666.shl
  • 39
    • 0342383040 scopus 로고    scopus 로고
    • personal communication
    • R. A. Hoskins, personal communication.
    • Hoskins, R.A.1
  • 40
    • 0342383044 scopus 로고    scopus 로고
    • note
    • In order to align the Celera sequences unambiguously to the external data, all significant HSPs at the parameters given in (27) were screened to identify "mutually unique regions" where the clone and contig sequences have a unique, reciprocal match relation.
  • 41
    • 0342383041 scopus 로고    scopus 로고
    • note
    • Most negative gaps arise because of inaccuracies in the distances implied by bundles - the bundle implies a small amount of overlap between two contigs because it is actually short, whereas the reality is that there is a small gap at that location. In a very small number of cases, there is an overlap, but it is because the distance estimate is too long by 3 standard deviations, or because there is a small bit of foreign DNA at the tip of a contig because of untrimmed vector or a chimeric read. None of these negative gaps has yet been found to imply incorrect assembly.
  • 42
    • 0342383043 scopus 로고    scopus 로고
    • note
    • We wish to thank H. Smith and S. Salzburg for the many collegial exchanges, M. Peterson and his team for keeping the machines humming, R. Thompson and his staff for providing us with an environment conducive to such an intense effort, and A. Glodek, C. Kraft, and A. Deslattes Mays, and their staff for getting the data to us.


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.