# GTR class documentation

class treetime.GTR(alphabet='nuc', prof_map=None, logger=None)[source]

Defines General-Time-Reversible model of character evolution.

__init__(alphabet='nuc', prof_map=None, logger=None)[source]

Initialize empty evolutionary model.

Parameters
• alphabet (str, numpy.array) –

Alphabet of the sequence. If a string is passed, it is understood as an alphabet name. In this case, the alphabet and its profile map are pulled from treetime.seq_utils.

If a numpy array of characters is passed, a new alphabet is constructed, and the default profile map is atached to it.

• prof_map (dict) – Dictionary linking characters in the sequence to the likelihood of observing characters in the alphabet. This is used to implement ambiguous characters like ‘N’=[1,1,1,1] which are equally likely to be any of the 4 nucleotides. Standard profile_maps are defined in file seq_utils.py.

• logger (callable) – Custom logging function that should take arguments (msg, level, warn=False), where msg is a string and level an integer to be compared against verbose.

static GTR.standard(model, **kwargs)[source]

Create standard model of molecular evolution.

Parameters
• model (str) – Model to create. See list of available models below

• **kwargs – Key word arguments to be passed to the model

Available models

• JC69:

Jukes-Cantor 1969 model. This model assumes equal frequencies of the nucleotides and equal transition rates between nucleotide states. For more info, see: Jukes and Cantor (1969). Evolution of Protein Molecules. New York: Academic Press. pp. 21-132. To create this model, use:

mygtr = GTR.standard(model='jc69', mu=<my_mu>, alphabet=<my_alph>)

my_mu - substitution rate (float)

my_alph - alphabet (str: 'nuc' or 'nuc_nogap')

• K80:

Kimura 1980 model. Assumes equal concentrations across nucleotides, but allows different rates between transitions and transversions. The ratio of the transversion/transition rates is given by kappa parameter. For more info, see Kimura (1980), J. Mol. Evol. 16 (2): 111-120. doi:10.1007/BF01731581. Current implementation of the model does not account for the gaps.

mygtr = GTR.standard(model='k80', mu=<my_mu>, kappa=<my_kappa>)

mu - overall substitution rate (float)

kappa - ratio of transversion/transition rates (float)

• F81:

Felsenstein 1981 model. Assumes non-equal concentrations across nucleotides, but the transition rate between all states is assumed to be equal. See Felsenstein (1981), J. Mol. Evol. 17 (6): 368-376. doi:10.1007/BF01734359 for details.

mygtr = GTR.standard(model='F81', mu=<mu>, pi=<pi>, alphabet=<alph>)

mu - substitution rate (float)

pi - : nucleotide concentrations (numpy.array)

alphabet' -  alphabet to use. (:code:’nuc’ or 'nuc_nogap')

• HKY85:

Hasegawa, Kishino and Yano 1985 model. Allows different concentrations of the nucleotides (as in F81) + distinguishes between transition/transversion substitutions (similar to K80). Link: Hasegawa, Kishino, Yano (1985), J. Mol. Evol. 22 (2): 160-174. doi:10.1007/BF02101694

Current implementation of the model does not account for the gaps

mygtr = GTR.standard(model='HKY85', mu=<mu>, pi=<pi>, kappa=<kappa>)

mu - substitution rate (float)

pi - : nucleotide concentrations (numpy.array)

kappa - ratio of transversion/transition rates (float)

• T92:

Tamura 1992 model. Extending Kimura (1980) model for the case where a G+C-content bias exists. Link: Tamura K (1992), Mol. Biol. Evol. 9 (4): 678-687. DOI: 10.1093/oxfordjournals.molbev.a040752

Current implementation of the model does not account for the gaps

mygtr = GTR.standard(model='T92', mu=<mu>, pi_GC=<pi_gc>, kappa=<kappa>)

mu - substitution rate (float)

pi_GC - : relative GC content

kappa - ratio of transversion/transition rates (float)

• TN93:

Tamura and Nei 1993. The model distinguishes between the two different types of transition: (A <-> G) is allowed to have a different rate to (C<->T). Transversions have the same rate. The frequencies of the nucleotides are allowed to be different. Link: Tamura, Nei (1993), MolBiol Evol. 10 (3): 512-526. DOI:10.1093/oxfordjournals.molbev.a040023

mygtr = GTR.standard(model='TN93', mu=<mu>, kappa1=<k1>, kappa2=<k2>)

mu - substitution rate (float)

kappa1 - relative A<–>C, A<–>T, T<–>G and G<–>C rates (float)

kappa - relative C<–>T rate (float)

Note

Rate of A<–>G substitution is set to one. All other rates (kappa1, kappa2) are specified relative to this rate

classmethod GTR.custom(mu=1.0, pi=None, W=None, **kwargs)[source]

Create a GTR model by specifying the matrix explicitly

Parameters
• mu (float) – Substitution rate

• W (nxn matrix) – Substitution matrix

• pi (n vector) – Equilibrium frequencies

• **kwargs – Key word arguments to be passed

Keyword Arguments

alphabet (str) – Specify alphabet when applicable. If the alphabet specification is required, but no alphabet is specified, the nucleotide alphabet will be used as default.

classmethod GTR.random(mu=1.0, alphabet='nuc')[source]

Creates a random GTR model

Parameters
• mu (float) – Substitution rate

• alphabet (str) – Alphabet name (should be standard: ‘nuc’, ‘nuc_gap’, ‘aa’, ‘aa_gap’)

classmethod GTR.infer(nij, Ti, root_state, fixed_pi=None, pc=1.0, gap_limit=0.01, **kwargs)[source]

Infer a GTR model by specifying the number of transitions and time spent in each character. The basic equation that is being solved is

$$n_{ij} = pi_i W_{ij} T_j$$

where $$n_{ij}$$ are the transitions, $$pi_i$$ are the equilibrium state frequencies, $$W_{ij}$$ is the “substitution attempt matrix”, while $$T_i$$ is the time on the tree spent in character state $$i$$. To regularize the process, we add pseudocounts and also need to account for the fact that the root of the tree is in a particular state. the modified equation is

$$n_{ij} + pc = pi_i W_{ij} (T_j+pc+root\_state)$$

Parameters
• nij (nxn matrix) – The number of times a change in character state is observed between state j and i

• Ti (n vector) – The time spent in each character state

• root_state (n vector) – The number of characters in state i in the sequence of the root node.

• pc (float) – Pseudocounts, this determines the lower cutoff on the rate when no substitutions are observed

• **kwargs – Key word arguments to be passed

Keyword Arguments

alphabet (str) – Specify alphabet when applicable. If the alphabet specification is required, but no alphabet is specified, the nucleotide alphabet will be used as default.

GTR.assign_rates(mu=1.0, pi=None, W=None)[source]

Overwrite the GTR model given the provided data

Parameters
• mu (float) – Substitution rate

• W (nxn matrix) – Substitution matrix

• pi (n vector) – Equilibrium frequencies

Note

GTR object can be modified in-place by calling treetime.GTR.assign_rates()

## Sequence manipulation

GTR.state_pair(seq_p, seq_ch, pattern_multiplicity=None, ignore_gaps=False)[source]

Make a compressed representation of a pair of sequences, only counting the number of times a particular pair of states (e.g. (A,T)) is observed in the aligned sequences of parent and child.

Parameters
• seq_p (numpy array) – Parent sequence as numpy array of chars

• seq_ch (numpy array) – Child sequence as numpy array of chars

• pattern_multiplicity (numpy array) – If sequences are reduced by combining identical alignment patterns, these multplicities need to be accounted for when counting the number of mutations across a branch. If None, all pattern are assumed to occur exactly once.

• ignore_gap (bool) – Whether or not to include gapped positions of the alignment in the multiplicity count

Returns

• seq_pair (list) – [(0,1), (2,2), (3,4)] list of parent_child state pairs as indices in the alphabet

• multiplicity (numpy array) – Number of times a particular pair is observed

## Distance and probability computations

GTR.optimal_t(seq_p, seq_ch, pattern_multiplicity=None, ignore_gaps=False)[source]

Find the optimal distance between the two sequences

Parameters
• seq_p (character array) – Parent sequence

• seq_c (character array) – Child sequence

• pattern_multiplicity (numpy array) – If sequences are reduced by combining identical alignment patterns, these multplicities need to be accounted for when counting the number of mutations across a branch. If None, all pattern are assumed to occur exactly once.

• ignore_gaps (bool) – If True, ignore gaps in distance calculations

GTR.optimal_t_compressed(seq_pair, multiplicity, profiles=False, tol=1e-10)[source]

Find the optimal distance between the two sequences represented as state_pairs or as pair of profiles

Parameters
• seq_pair (state_pair, tuple) – Compressed representation of sequences along a branch, either as tuple of state pairs or as tuple of profiles.

• multiplicity (array) –

Number of times each state pair in seq_pair appears (if profile==False)

Number of times an alignment pattern is observed (if profiles==True)

• profiles (bool, default False) – The standard branch length optimization assumes fixed sequences at either end of the branch. With profiles==True, optimization is performed while summing over all possible states of the nodes at either end of the branch. Note that the meaning/format of seq_pair and multiplicity depend on the value of :profiles:.

GTR.prob_t(seq_p, seq_ch, t, pattern_multiplicity=None, return_log=False, ignore_gaps=True)[source]

Compute the probability to observe seq_ch (child sequence) after time t starting from seq_p (parent sequence).

Parameters
• seq_p (character array) – Parent sequence

• seq_c (character array) – Child sequence

• t (double) – Time (branch len) separating the profiles.

• pattern_multiplicity (numpy array) – If sequences are reduced by combining identical alignment patterns, these multplicities need to be accounted for when counting the number of mutations across a branch. If None, all pattern are assumed to occur exactly once.

• return_log (bool) – It True, return log-probability.

Returns

prob – Resulting probability

Return type

np.array

GTR.prob_t_compressed(seq_pair, multiplicity, t, return_log=False)[source]

Calculate the probability of observing a sequence pair at a distance t, for compressed sequences

Parameters
• seq_pair (numpy array) – np.array([(0,1), (2,2), ()..])` as indicies of pairs of aligned positions. (e.g. ‘A’==0, ‘C’==1 etc). This only lists all occuring parent-child state pairs, order is irrelevant

• multiplicity (numpy array) – The number of times a parent-child state pair is observed. This allows compression of the sequence representation

• t (float) – Length of the branch separating parent and child

• return_log (bool) – Whether or not to exponentiate the result

GTR.prob_t_profiles(profile_pair, multiplicity, t, return_log=False, ignore_gaps=True)[source]

Calculate the probability of observing a node pair at a distance t

Parameters
• profile_pair (numpy arrays) – Probability distributions of the nucleotides at either end of the branch. pp[0] = parent, pp[1] = child

• multiplicity (numpy array) – The number of times an alignment pattern is observed

• t (float) – Length of the branch separating parent and child

• ignore_gaps (bool) – If True, ignore mutations to and from gaps in distance calculations

• return_log (bool) – Whether or not to exponentiate the result

GTR.propagate_profile(profile, t, return_log=False)[source]

Compute the probability of the sequence state of the parent at time (t+t0, backwards), given the sequence state of the child (profile) at time t0.

Parameters
• profile (numpy.array) – Sequence profile. Shape = (L, a), where L - sequence length, a - alphabet size.

• t (double) – Time to propagate

• return_log (bool) – If True, return log-probability

Returns

res – Profile of the sequence after time t in the past. Shape = (L, a), where L - sequence length, a - alphabet size.

Return type

np.array

GTR.sequence_logLH(seq, pattern_multiplicity=None)[source]

Returns the log-likelihood of sampling a sequence from equilibrium frequency. Expects a sequence as numpy array

Parameters
• seq (numpy array) – Compressed sequence as an array of chars

• pattern_multiplicity (numpy_array) – The number of times each position in sequence is observed in the initial alignment. If None, sequence is assumed to be not compressed

GTR.expQt(t)[source]
Parameters

t (float) – Time to propagate

Returns

expQt – Matrix exponential of exo(Qt)

Return type

numpy.array