-
1
-
-
0004279283
-
-
Oxford Univ. Press, Oxford
-
B. Lewin, Genes VI (Oxford Univ. Press, Oxford, 1997)
-
(1997)
Genes VI
-
-
Lewin, B.1
-
6
-
-
0026647310
-
-
R. Guigo, S. Knudsen, N. Drake, and T. F. Smith, J. Mol. Biol. 226, 141 (1992)
-
(1992)
J. Mol. Biol.
, vol.226
, pp. 141
-
-
Guigo, R.1
Knudsen, S.2
Drake, N.3
Smith, T.F.4
-
14
-
-
0031010292
-
-
S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya, and R. Ramaswamy, Comput. Appl. Biosci 13, 263 (1997)
-
(1997)
Comput. Appl. Biosci
, vol.13
, pp. 263
-
-
Tiwari, S.1
Ramachandran, S.2
Bhattacharya, A.3
Bhattacharya, S.4
Ramaswamy, R.5
-
21
-
-
0030768930
-
-
R. Guigo, DNA Composition, Codon Usage, and Exon Prediction, in Bishop (ed.) “Genetics Databases” (Academic Press, New York, 1999), pp 53–79
-
J.-M. Claverie, Hum. Mol. Genet. 6, 1735 (1997);R. Guigo, DNA Composition, Codon Usage, and Exon Prediction, in Bishop (ed.) “Genetics Databases” (Academic Press, New York, 1999), pp 53–79.
-
(1997)
Hum. Mol. Genet.
, vol.6
, pp. 1735
-
-
Claverie, J.-M.1
-
22
-
-
85036390359
-
-
A coding measure is a function f that maps a statistical pattern (Formula presented) to a real number (Formula presented) such that the probability distribution functions of y are different in coding and noncoding DNA. Typically, (Formula presented) is high dimensional, and f depends on many empirical parameters. Typically, these parameters vary significantly from species to species. Hence, these parameters must be fitted by empirical analyses of species-specific data sets. The process of fitting the parameters is called training of the coding measure
-
A coding measure is a function f that maps a statistical pattern (Formula presented) to a real number (Formula presented) such that the probability distribution functions of y are different in coding and noncoding DNA. Typically, (Formula presented) is high dimensional, and f depends on many empirical parameters. Typically, these parameters vary significantly from species to species. Hence, these parameters must be fitted by empirical analyses of species-specific data sets. The process of fitting the parameters is called training of the coding measure.
-
-
-
-
23
-
-
85036362624
-
-
The mutual information function is similar to, but different from, autocorrelation functions (Ref. 13). Its main advantage over correlation functions is that it does not require any mapping of symbols to numbers, which affects the analysis of symbolic sequences by correlation functions, because correlation functions are not invariant under changes of the map. Moreover, the mutual information function is capable of detecting any deviation from statistical independence, whereas—by definition—correlation functions measure only linear dependences. Hence, we use the mutual information function in our analysis of DNA sequences
-
The mutual information function is similar to, but different from, autocorrelation functions (Ref. 13). Its main advantage over correlation functions is that it does not require any mapping of symbols to numbers, which affects the analysis of symbolic sequences by correlation functions, because correlation functions are not invariant under changes of the map. Moreover, the mutual information function is capable of detecting any deviation from statistical independence, whereas—by definition—correlation functions measure only linear dependences. Hence, we use the mutual information function in our analysis of DNA sequences.
-
-
-
-
25
-
-
0032900994
-
-
ftp://ncbi.nlm.nih.gov/genbank/)
-
We use all eukaryotic DNA sequences from GenBank release 111 (D. A. Benson, M. S. Boguski, D. J. Lipman, J. Ostell, B. F. Ouellette, B. A. Rapp, and D. L. Wheeler, Nucleic Acids Res. 27, 12 (1999), ftp://ncbi.nlm.nih.gov/genbank/).
-
(1999)
Nucleic Acids Res.
, vol.27
, pp. 12
-
-
Benson, D.A.1
Boguski, M.S.2
Lipman, D.J.3
Ostell, J.4
Ouellette, B.F.5
Rapp, B.A.6
Wheeler, D.L.7
-
26
-
-
85036318758
-
-
There are (Formula presented) codons, 3 of which are stop codons, and 61 of which encode 20 amino acids. Hence, the genetic code is degenerate, i.e., there are (many) amino acids that are encoded by more than one codon. All codons that encode the same amino acid are called synonymous codons
-
There are (Formula presented) codons, 3 of which are stop codons, and 61 of which encode 20 amino acids. Hence, the genetic code is degenerate, i.e., there are (many) amino acids that are encoded by more than one codon. All codons that encode the same amino acid are called synonymous codons.
-
-
-
-
35
-
-
85036142814
-
-
Mathematically, (Formula presented) can be defined in terms of (Formula presented) as follows: (Formula presented) (Formula presented) and (Formula presented)
-
Mathematically, (Formula presented) can be defined in terms of (Formula presented) as follows: (Formula presented) (Formula presented) and (Formula presented)
-
-
-
-
36
-
-
85036242720
-
-
Since the genetic code is a nonoverlapping triplet code, there are three frames in which a DNA sequence can be translated into an amino acid sequence. In the cell, only one of the three reading frames encodes the proper amino acid, but in our statistical analysis the choice of the reading frame is arbitrary in the sense that (Formula presented) is invariant under shifts of the reading frame
-
Since the genetic code is a nonoverlapping triplet code, there are three frames in which a DNA sequence can be translated into an amino acid sequence. In the cell, only one of the three reading frames encodes the proper amino acid, but in our statistical analysis the choice of the reading frame is arbitrary in the sense that (Formula presented) is invariant under shifts of the reading frame.
-
-
-
-
37
-
-
85036145528
-
-
terms of the mutual information function (Formula presented) for the pseudo-exon model, the average mutual information (Formula presented) can be expressed as (Formula presented)
-
In terms of the mutual information function (Formula presented) for the pseudo-exon model, the average mutual information (Formula presented) can be expressed as (Formula presented)
-
-
-
-
38
-
-
85036395928
-
-
We choose the length to be 54 bp in order to allow a comparison with the standard data set created in Ref. 5, which consists of sequences of length 54 bp
-
We choose the length to be 54 bp in order to allow a comparison with the standard data set created in Ref. 5, which consists of sequences of length 54 bp.
-
-
-
-
39
-
-
85036231461
-
-
Here, true positives (true negatives) refer to correctly-predicted coding (noncoding) sequences, and positives (negatives) refer to all coding (noncoding) sequences. Hence, (Formula presented) (Formula presented) denotes the fraction of correctly predicted coding (noncoding) sequences over all coding (noncoding) sequences. Mathematically, (Formula presented) and (Formula presented) are defined by (Formula presented) and (Formula presented) where θ denotes the Heavyside function, i.e., (Formula presented) for (Formula presented) and (Formula presented) for (Formula presented)
-
Here, true positives (true negatives) refer to correctly-predicted coding (noncoding) sequences, and positives (negatives) refer to all coding (noncoding) sequences. Hence, (Formula presented) (Formula presented) denotes the fraction of correctly predicted coding (noncoding) sequences over all coding (noncoding) sequences. Mathematically, (Formula presented) and (Formula presented) are defined by (Formula presented) and (Formula presented) where θ denotes the Heavyside function, i.e., (Formula presented) for (Formula presented) and (Formula presented) for (Formula presented)
-
-
-
-
40
-
-
85036266045
-
-
If (Formula presented) and (Formula presented) were identical, (Formula presented) would be equal to 1. If (Formula presented) and (Formula presented) were completely disjoint (non-overlapping), (Formula presented) would be equal to 2
-
If (Formula presented) and (Formula presented) were identical, (Formula presented) would be equal to 1. If (Formula presented) and (Formula presented) were completely disjoint (non-overlapping), (Formula presented) would be equal to 2.
-
-
-
-
41
-
-
85036294751
-
-
It is clear that (Formula presented) can be computed from sequences of any length N (which does not need to be a multiple of 54 bp). We present the accuracy of (Formula presented) for (Formula presented) (Formula presented) and (Formula presented) because these are the three length scales on which all of the 21 coding measures in Ref. 5 are evaluated
-
It is clear that (Formula presented) can be computed from sequences of any length N (which does not need to be a multiple of 54 bp). We present the accuracy of (Formula presented) for (Formula presented) (Formula presented) and (Formula presented) because these are the three length scales on which all of the 21 coding measures in Ref. 5 are evaluated.
-
-
-
-
42
-
-
85036213535
-
-
Figs. 22 and 33 and in Table I we take the logarithm of (Formula presented) because (i) the (Formula presented) distributions have a broad tail (ranging over several orders of magnitude), and (ii) they are sharply peaked at (Formula presented) Consequently, the moments of (Formula presented) are dominated by large values of (Formula presented) and not by the bulk of the distribution. Hence, we display the density and compute the moments of (Formula presented) rather than those of (Formula presented)
-
In Figs. 22 and 33 and in Table I we take the logarithm of (Formula presented) because (i) the (Formula presented) distributions have a broad tail (ranging over several orders of magnitude), and (ii) they are sharply peaked at (Formula presented) Consequently, the moments of (Formula presented) are dominated by large values of (Formula presented) and not by the bulk of the distribution. Hence, we display the density and compute the moments of (Formula presented) rather than those of (Formula presented)
-
-
-
-
43
-
-
85036289813
-
-
The mathematical proof can be found in: H. Cramer, Mathematical Methods of Statistics (Princeton University Press, Princeton, 1946). An intuitive heuristic argument of why the number of degrees of freedom is equal to 6 is that there are (Formula presented) independent linear constraints that the (Formula presented) numbers (Formula presented) must satisfy. Hence, the number of degrees of freedom is (Formula presented)
-
The mathematical proof can be found in: H. Cramer, Mathematical Methods of Statistics (Princeton University Press, Princeton, 1946). An intuitive heuristic argument of why the number of degrees of freedom is equal to 6 is that there are (Formula presented) independent linear constraints that the (Formula presented) numbers (Formula presented) must satisfy. Hence, the number of degrees of freedom is (Formula presented)
-
-
-
-
44
-
-
85036275758
-
-
For the probabilities (Formula presented) we choose the total number of nucleotides (Formula presented) in position m of the biological reading frame divided by the total number of nucleotides from exactly the same set of coding human sequences to which the model sequences are compared
-
For the probabilities (Formula presented) we choose the total number of nucleotides (Formula presented) in position m of the biological reading frame divided by the total number of nucleotides from exactly the same set of coding human sequences to which the model sequences are compared.
-
-
-
-
45
-
-
85036290423
-
-
By correlations or inhomogeneities we mean that the probability distributions (Formula presented) are not constant, but vary along the DNA sequence from gene to gene and also within a gene. These variations of the probability distributions (Formula presented) seem to be a typical feature of coding DNA of any living organism
-
By correlations or inhomogeneities we mean that the probability distributions (Formula presented) are not constant, but vary along the DNA sequence from gene to gene and also within a gene. These variations of the probability distributions (Formula presented) seem to be a typical feature of coding DNA of any living organism.
-
-
-
|