GTR class documentation
- class treetime.GTR(alphabet='nuc', prof_map=None, logger=None)[source]
Defines General-Time-Reversible model of character evolution.
- __init__(alphabet='nuc', prof_map=None, logger=None)[source]
Initialize empty evolutionary model.
- Parameters:
alphabet (str, numpy.array) –
Alphabet of the sequence. If a string is passed, it is understood as an alphabet name. In this case, the alphabet and its profile map are pulled from
treetime.seq_utils
.If a numpy array of characters is passed, a new alphabet is constructed, and the default profile map is atached to it.
prof_map (dict) – Dictionary linking characters in the sequence to the likelihood of observing characters in the alphabet. This is used to implement ambiguous characters like ‘N’=[1,1,1,1] which are equally likely to be any of the 4 nucleotides. Standard profile_maps are defined in file seq_utils.py.
logger (callable) – Custom logging function that should take arguments (msg, level, warn=False), where msg is a string and level an integer to be compared against verbose.
- static GTR.standard(model, **kwargs)[source]
Create standard model of molecular evolution.
- Parameters:
model (str) – Model to create. See list of available models below
**kwargs – Key word arguments to be passed to the model
Available models
JC69:
Jukes-Cantor 1969 model. This model assumes equal frequencies of the nucleotides and equal transition rates between nucleotide states. For more info, see: Jukes and Cantor (1969). Evolution of Protein Molecules. New York: Academic Press. pp. 21-132. To create this model, use:
mygtr = GTR.standard(model='jc69', mu=<my_mu>, alphabet=<my_alph>)
my_mu
- substitution rate (float)my_alph
- alphabet (str:'nuc'
or'nuc_nogap'
)K80:
Kimura 1980 model. Assumes equal concentrations across nucleotides, but allows different rates between transitions and transversions. The ratio of the transversion/transition rates is given by kappa parameter. For more info, see Kimura (1980), J. Mol. Evol. 16 (2): 111-120. doi:10.1007/BF01731581. Current implementation of the model does not account for the gaps.
mygtr = GTR.standard(model='k80', mu=<my_mu>, kappa=<my_kappa>)
mu
- overall substitution rate (float)kappa
- ratio of transversion/transition rates (float)F81:
Felsenstein 1981 model. Assumes non-equal concentrations across nucleotides, but the transition rate between all states is assumed to be equal. See Felsenstein (1981), J. Mol. Evol. 17 (6): 368-376. doi:10.1007/BF01734359 for details.
mygtr = GTR.standard(model='F81', mu=<mu>, pi=<pi>, alphabet=<alph>)
mu
- substitution rate (float)pi
- : nucleotide concentrations (numpy.array)alphabet' - alphabet to use. (:code:
’nuc’` or'nuc_nogap'
)HKY85:
Hasegawa, Kishino and Yano 1985 model. Allows different concentrations of the nucleotides (as in F81) + distinguishes between transition/transversion substitutions (similar to K80). Link: Hasegawa, Kishino, Yano (1985), J. Mol. Evol. 22 (2): 160-174. doi:10.1007/BF02101694
Current implementation of the model does not account for the gaps
mygtr = GTR.standard(model='HKY85', mu=<mu>, pi=<pi>, kappa=<kappa>)
mu
- substitution rate (float)pi
- : nucleotide concentrations (numpy.array)kappa
- ratio of transversion/transition rates (float)T92:
Tamura 1992 model. Extending Kimura (1980) model for the case where a G+C-content bias exists. Link: Tamura K (1992), Mol. Biol. Evol. 9 (4): 678-687. DOI: 10.1093/oxfordjournals.molbev.a040752
Current implementation of the model does not account for the gaps
mygtr = GTR.standard(model='T92', mu=<mu>, pi_GC=<pi_gc>, kappa=<kappa>)
mu
- substitution rate (float)pi_GC
- : relative GC contentkappa
- ratio of transversion/transition rates (float)TN93:
Tamura and Nei 1993. The model distinguishes between the two different types of transition: (A <-> G) is allowed to have a different rate to (C<->T). Transversions have the same rate. The frequencies of the nucleotides are allowed to be different. Link: Tamura, Nei (1993), MolBiol Evol. 10 (3): 512-526. DOI:10.1093/oxfordjournals.molbev.a040023
mygtr = GTR.standard(model='TN93', mu=<mu>, kappa1=<k1>, kappa2=<k2>)
mu
- substitution rate (float)kappa1
- relative A<–>C, A<–>T, T<–>G and G<–>C rates (float)kappa
- relative C<–>T rate (float)Note
Rate of A<–>G substitution is set to one. All other rates (kappa1, kappa2) are specified relative to this rate
- classmethod GTR.custom(mu=1.0, pi=None, W=None, **kwargs)[source]
Create a GTR model by specifying the matrix explicitly
- Parameters:
mu (float) – Substitution rate
W (nxn matrix) – Substitution matrix
pi (n vector) – Equilibrium frequencies
**kwargs – Key word arguments to be passed
- Keyword Arguments:
alphabet (str) – Specify alphabet when applicable. If the alphabet specification is required, but no alphabet is specified, the nucleotide alphabet will be used as default.
- classmethod GTR.random(mu=1.0, alphabet='nuc', rng=None)[source]
Creates a random GTR model
- Parameters:
mu (float) – Substitution rate
alphabet (str) – Alphabet name (should be standard: ‘nuc’, ‘nuc_gap’, ‘aa’, ‘aa_gap’)
- classmethod GTR.infer(nij, Ti, root_state, fixed_pi=None, pc=1.0, gap_limit=0.01, **kwargs)[source]
Infer a GTR model by specifying the number of transitions and time spent in each character. The basic equation that is being solved is
\(n_{ij} = pi_i W_{ij} T_j\)
where \(n_{ij}\) are the transitions, \(pi_i\) are the equilibrium state frequencies, \(W_{ij}\) is the “substitution attempt matrix”, while \(T_i\) is the time on the tree spent in character state \(i\). To regularize the process, we add pseudocounts and also need to account for the fact that the root of the tree is in a particular state. the modified equation is
\(n_{ij} + pc = pi_i W_{ij} (T_j+pc+root\_state)\)
- Parameters:
nij (nxn matrix) – The number of times a change in character state is observed between state j and i
Ti (n vector) – The time spent in each character state
root_state (n vector) – The number of characters in state i in the sequence of the root node.
pc (float) – Pseudocounts, this determines the lower cutoff on the rate when no substitutions are observed
**kwargs – Key word arguments to be passed
- Keyword Arguments:
alphabet (str) – Specify alphabet when applicable. If the alphabet specification is required, but no alphabet is specified, the nucleotide alphabet will be used as default.
- GTR.assign_rates(mu=1.0, pi=None, W=None)[source]
Overwrite the GTR model given the provided data
- Parameters:
mu (float) – Substitution rate
W (nxn matrix) – Substitution matrix
pi (n vector) – Equilibrium frequencies
Note
GTR object can be modified in-place by calling treetime.GTR.assign_rates()
Sequence manipulation
- GTR.state_pair(seq_p, seq_ch, pattern_multiplicity=None, ignore_gaps=False)[source]
Make a compressed representation of a pair of sequences, only counting the number of times a particular pair of states (e.g. (A,T)) is observed in the aligned sequences of parent and child.
- Parameters:
seq_p (numpy array) – Parent sequence as numpy array of chars
seq_ch (numpy array) – Child sequence as numpy array of chars
pattern_multiplicity (numpy array) – If sequences are reduced by combining identical alignment patterns, these multplicities need to be accounted for when counting the number of mutations across a branch. If None, all pattern are assumed to occur exactly once.
ignore_gap (bool) – Whether or not to include gapped positions of the alignment in the multiplicity count
- Returns:
seq_pair (list) –
[(0,1), (2,2), (3,4)]
list of parent_child state pairs as indices in the alphabetmultiplicity (numpy array) – Number of times a particular pair is observed
Distance and probability computations
- GTR.optimal_t(seq_p, seq_ch, pattern_multiplicity=None, ignore_gaps=False)[source]
Find the optimal distance between the two sequences
- Parameters:
seq_p (character array) – Parent sequence
seq_c (character array) – Child sequence
pattern_multiplicity (numpy array) – If sequences are reduced by combining identical alignment patterns, these multplicities need to be accounted for when counting the number of mutations across a branch. If None, all pattern are assumed to occur exactly once.
ignore_gaps (bool) – If True, ignore gaps in distance calculations
- GTR.optimal_t_compressed(seq_pair, multiplicity, profiles=False, tol=1e-10)[source]
Find the optimal distance between the two sequences represented as state_pairs or as pair of profiles
- Parameters:
seq_pair (state_pair, tuple) – Compressed representation of sequences along a branch, either as tuple of state pairs or as tuple of profiles.
multiplicity (array) –
Number of times each state pair in seq_pair appears (if profile==False)
Number of times an alignment pattern is observed (if profiles==True)
profiles (bool, default False) – The standard branch length optimization assumes fixed sequences at either end of the branch. With profiles==True, optimization is performed while summing over all possible states of the nodes at either end of the branch. Note that the meaning/format of seq_pair and multiplicity depend on the value of :profiles:.
- GTR.prob_t(seq_p, seq_ch, t, pattern_multiplicity=None, return_log=False, ignore_gaps=True)[source]
Compute the probability to observe seq_ch (child sequence) after time t starting from seq_p (parent sequence).
- Parameters:
seq_p (character array) – Parent sequence
seq_c (character array) – Child sequence
t (double) – Time (branch len) separating the profiles.
pattern_multiplicity (numpy array) – If sequences are reduced by combining identical alignment patterns, these multplicities need to be accounted for when counting the number of mutations across a branch. If None, all pattern are assumed to occur exactly once.
return_log (bool) – It True, return log-probability.
- Returns:
prob – Resulting probability
- Return type:
np.array
- GTR.prob_t_compressed(seq_pair, multiplicity, t, return_log=False)[source]
Calculate the probability of observing a sequence pair at a distance t, for compressed sequences
- Parameters:
seq_pair (numpy array) –
np.array([(0,1), (2,2), ()..])
as indicies of pairs of aligned positions. (e.g. ‘A’==0, ‘C’==1 etc). This only lists all occuring parent-child state pairs, order is irrelevantmultiplicity (numpy array) – The number of times a parent-child state pair is observed. This allows compression of the sequence representation
t (float) – Length of the branch separating parent and child
return_log (bool) – Whether or not to exponentiate the result
- GTR.prob_t_profiles(profile_pair, multiplicity, t, return_log=False, ignore_gaps=True)[source]
Calculate the probability of observing a node pair at a distance t
- Parameters:
profile_pair (numpy arrays) – Probability distributions of the nucleotides at either end of the branch. pp[0] = parent, pp[1] = child
multiplicity (numpy array) – The number of times an alignment pattern is observed
t (float) – Length of the branch separating parent and child
ignore_gaps (bool) – If True, ignore mutations to and from gaps in distance calculations
return_log (bool) – Whether or not to exponentiate the result
- GTR.propagate_profile(profile, t, return_log=False)[source]
Compute the probability of the sequence state of the parent at time (t+t0, backwards), given the sequence state of the child (profile) at time t0.
- Parameters:
profile (numpy.array) – Sequence profile. Shape = (L, a), where L - sequence length, a - alphabet size.
t (double) – Time to propagate
return_log (bool) – If True, return log-probability
- Returns:
res – Profile of the sequence after time t in the past. Shape = (L, a), where L - sequence length, a - alphabet size.
- Return type:
np.array
- GTR.sequence_logLH(seq, pattern_multiplicity=None)[source]
Returns the log-likelihood of sampling a sequence from equilibrium frequency. Expects a sequence as numpy array
- Parameters:
seq (numpy array) – Compressed sequence as an array of chars
pattern_multiplicity (numpy_array) – The number of times each position in sequence is observed in the initial alignment. If None, sequence is assumed to be not compressed