GTR class documentation

class treetime.GTR(alphabet='nuc', prof_map=None, logger=None)[source]

Defines General-Time-Reversible model of character evolution.

__init__(alphabet='nuc', prof_map=None, logger=None)[source]

Initialize empty evolutionary model.

Parameters:
  • alphabet (str, numpy.array) –

    Alphabet of the sequence. If a string is passed, it is understood as an alphabet name. In this case, the alphabet and its profile map are pulled from treetime.seq_utils.

    If a numpy array of characters is passed, a new alphabet is constructed, and the default profile map is atached to it.

  • prof_map (dict) – Dictionary linking characters in the sequence to the likelihood of observing characters in the alphabet. This is used to implement ambiguous characters like ‘N’=[1,1,1,1] which are equally likely to be any of the 4 nucleotides. Standard profile_maps are defined in file seq_utils.py.

  • logger (callable) – Custom logging function that should take arguments (msg, level, warn=False), where msg is a string and level an integer to be compared against verbose.

static GTR.standard(model, **kwargs)[source]

Create standard model of molecular evolution.

Parameters:
  • model (str) – Model to create. See list of available models below

  • **kwargs – Key word arguments to be passed to the model

Available models

  • JC69:

    Jukes-Cantor 1969 model. This model assumes equal frequencies of the nucleotides and equal transition rates between nucleotide states. For more info, see: Jukes and Cantor (1969). Evolution of Protein Molecules. New York: Academic Press. pp. 21-132. To create this model, use:

    mygtr = GTR.standard(model='jc69', mu=<my_mu>, alphabet=<my_alph>)

    my_mu - substitution rate (float)

    my_alph - alphabet (str: 'nuc' or 'nuc_nogap')

  • K80:

    Kimura 1980 model. Assumes equal concentrations across nucleotides, but allows different rates between transitions and transversions. The ratio of the transversion/transition rates is given by kappa parameter. For more info, see Kimura (1980), J. Mol. Evol. 16 (2): 111-120. doi:10.1007/BF01731581. Current implementation of the model does not account for the gaps.

    mygtr = GTR.standard(model='k80', mu=<my_mu>, kappa=<my_kappa>)

    mu - overall substitution rate (float)

    kappa - ratio of transversion/transition rates (float)

  • F81:

    Felsenstein 1981 model. Assumes non-equal concentrations across nucleotides, but the transition rate between all states is assumed to be equal. See Felsenstein (1981), J. Mol. Evol. 17 (6): 368-376. doi:10.1007/BF01734359 for details.

    mygtr = GTR.standard(model='F81', mu=<mu>, pi=<pi>, alphabet=<alph>)

    mu - substitution rate (float)

    pi - : nucleotide concentrations (numpy.array)

    alphabet' -  alphabet to use. (:code:’nuc’` or 'nuc_nogap')

  • HKY85:

    Hasegawa, Kishino and Yano 1985 model. Allows different concentrations of the nucleotides (as in F81) + distinguishes between transition/transversion substitutions (similar to K80). Link: Hasegawa, Kishino, Yano (1985), J. Mol. Evol. 22 (2): 160-174. doi:10.1007/BF02101694

    Current implementation of the model does not account for the gaps

    mygtr = GTR.standard(model='HKY85', mu=<mu>, pi=<pi>, kappa=<kappa>)

    mu - substitution rate (float)

    pi - : nucleotide concentrations (numpy.array)

    kappa - ratio of transversion/transition rates (float)

  • T92:

    Tamura 1992 model. Extending Kimura (1980) model for the case where a G+C-content bias exists. Link: Tamura K (1992), Mol. Biol. Evol. 9 (4): 678-687. DOI: 10.1093/oxfordjournals.molbev.a040752

    Current implementation of the model does not account for the gaps

    mygtr = GTR.standard(model='T92', mu=<mu>, pi_GC=<pi_gc>, kappa=<kappa>)

    mu - substitution rate (float)

    pi_GC - : relative GC content

    kappa - ratio of transversion/transition rates (float)

  • TN93:

    Tamura and Nei 1993. The model distinguishes between the two different types of transition: (A <-> G) is allowed to have a different rate to (C<->T). Transversions have the same rate. The frequencies of the nucleotides are allowed to be different. Link: Tamura, Nei (1993), MolBiol Evol. 10 (3): 512-526. DOI:10.1093/oxfordjournals.molbev.a040023

    mygtr = GTR.standard(model='TN93', mu=<mu>, kappa1=<k1>, kappa2=<k2>)

    mu - substitution rate (float)

    kappa1 - relative A<–>C, A<–>T, T<–>G and G<–>C rates (float)

    kappa - relative C<–>T rate (float)

    Note

    Rate of A<–>G substitution is set to one. All other rates (kappa1, kappa2) are specified relative to this rate

classmethod GTR.custom(mu=1.0, pi=None, W=None, **kwargs)[source]

Create a GTR model by specifying the matrix explicitly

Parameters:
  • mu (float) – Substitution rate

  • W (nxn matrix) – Substitution matrix

  • pi (n vector) – Equilibrium frequencies

  • **kwargs – Key word arguments to be passed

Keyword Arguments:

alphabet (str) – Specify alphabet when applicable. If the alphabet specification is required, but no alphabet is specified, the nucleotide alphabet will be used as default.

classmethod GTR.random(mu=1.0, alphabet='nuc', rng=None)[source]

Creates a random GTR model

Parameters:
  • mu (float) – Substitution rate

  • alphabet (str) – Alphabet name (should be standard: ‘nuc’, ‘nuc_gap’, ‘aa’, ‘aa_gap’)

classmethod GTR.infer(nij, Ti, root_state, fixed_pi=None, pc=1.0, gap_limit=0.01, **kwargs)[source]

Infer a GTR model by specifying the number of transitions and time spent in each character. The basic equation that is being solved is

\(n_{ij} = pi_i W_{ij} T_j\)

where \(n_{ij}\) are the transitions, \(pi_i\) are the equilibrium state frequencies, \(W_{ij}\) is the “substitution attempt matrix”, while \(T_i\) is the time on the tree spent in character state \(i\). To regularize the process, we add pseudocounts and also need to account for the fact that the root of the tree is in a particular state. the modified equation is

\(n_{ij} + pc = pi_i W_{ij} (T_j+pc+root\_state)\)

Parameters:
  • nij (nxn matrix) – The number of times a change in character state is observed between state j and i

  • Ti (n vector) – The time spent in each character state

  • root_state (n vector) – The number of characters in state i in the sequence of the root node.

  • pc (float) – Pseudocounts, this determines the lower cutoff on the rate when no substitutions are observed

  • **kwargs – Key word arguments to be passed

Keyword Arguments:

alphabet (str) – Specify alphabet when applicable. If the alphabet specification is required, but no alphabet is specified, the nucleotide alphabet will be used as default.

GTR.assign_rates(mu=1.0, pi=None, W=None)[source]

Overwrite the GTR model given the provided data

Parameters:
  • mu (float) – Substitution rate

  • W (nxn matrix) – Substitution matrix

  • pi (n vector) – Equilibrium frequencies

Note

GTR object can be modified in-place by calling treetime.GTR.assign_rates()

Sequence manipulation

GTR.state_pair(seq_p, seq_ch, pattern_multiplicity=None, ignore_gaps=False)[source]

Make a compressed representation of a pair of sequences, only counting the number of times a particular pair of states (e.g. (A,T)) is observed in the aligned sequences of parent and child.

Parameters:
  • seq_p (numpy array) – Parent sequence as numpy array of chars

  • seq_ch (numpy array) – Child sequence as numpy array of chars

  • pattern_multiplicity (numpy array) – If sequences are reduced by combining identical alignment patterns, these multplicities need to be accounted for when counting the number of mutations across a branch. If None, all pattern are assumed to occur exactly once.

  • ignore_gap (bool) – Whether or not to include gapped positions of the alignment in the multiplicity count

Returns:

  • seq_pair (list) – [(0,1), (2,2), (3,4)] list of parent_child state pairs as indices in the alphabet

  • multiplicity (numpy array) – Number of times a particular pair is observed

Distance and probability computations

GTR.optimal_t(seq_p, seq_ch, pattern_multiplicity=None, ignore_gaps=False)[source]

Find the optimal distance between the two sequences

Parameters:
  • seq_p (character array) – Parent sequence

  • seq_c (character array) – Child sequence

  • pattern_multiplicity (numpy array) – If sequences are reduced by combining identical alignment patterns, these multplicities need to be accounted for when counting the number of mutations across a branch. If None, all pattern are assumed to occur exactly once.

  • ignore_gaps (bool) – If True, ignore gaps in distance calculations

GTR.optimal_t_compressed(seq_pair, multiplicity, profiles=False, tol=1e-10)[source]

Find the optimal distance between the two sequences represented as state_pairs or as pair of profiles

Parameters:
  • seq_pair (state_pair, tuple) – Compressed representation of sequences along a branch, either as tuple of state pairs or as tuple of profiles.

  • multiplicity (array) –

    Number of times each state pair in seq_pair appears (if profile==False)

    Number of times an alignment pattern is observed (if profiles==True)

  • profiles (bool, default False) – The standard branch length optimization assumes fixed sequences at either end of the branch. With profiles==True, optimization is performed while summing over all possible states of the nodes at either end of the branch. Note that the meaning/format of seq_pair and multiplicity depend on the value of :profiles:.

GTR.prob_t(seq_p, seq_ch, t, pattern_multiplicity=None, return_log=False, ignore_gaps=True)[source]

Compute the probability to observe seq_ch (child sequence) after time t starting from seq_p (parent sequence).

Parameters:
  • seq_p (character array) – Parent sequence

  • seq_c (character array) – Child sequence

  • t (double) – Time (branch len) separating the profiles.

  • pattern_multiplicity (numpy array) – If sequences are reduced by combining identical alignment patterns, these multplicities need to be accounted for when counting the number of mutations across a branch. If None, all pattern are assumed to occur exactly once.

  • return_log (bool) – It True, return log-probability.

Returns:

prob – Resulting probability

Return type:

np.array

GTR.prob_t_compressed(seq_pair, multiplicity, t, return_log=False)[source]

Calculate the probability of observing a sequence pair at a distance t, for compressed sequences

Parameters:
  • seq_pair (numpy array) – np.array([(0,1), (2,2), ()..]) as indicies of pairs of aligned positions. (e.g. ‘A’==0, ‘C’==1 etc). This only lists all occuring parent-child state pairs, order is irrelevant

  • multiplicity (numpy array) – The number of times a parent-child state pair is observed. This allows compression of the sequence representation

  • t (float) – Length of the branch separating parent and child

  • return_log (bool) – Whether or not to exponentiate the result

GTR.prob_t_profiles(profile_pair, multiplicity, t, return_log=False, ignore_gaps=True)[source]

Calculate the probability of observing a node pair at a distance t

Parameters:
  • profile_pair (numpy arrays) – Probability distributions of the nucleotides at either end of the branch. pp[0] = parent, pp[1] = child

  • multiplicity (numpy array) – The number of times an alignment pattern is observed

  • t (float) – Length of the branch separating parent and child

  • ignore_gaps (bool) – If True, ignore mutations to and from gaps in distance calculations

  • return_log (bool) – Whether or not to exponentiate the result

GTR.propagate_profile(profile, t, return_log=False)[source]

Compute the probability of the sequence state of the parent at time (t+t0, backwards), given the sequence state of the child (profile) at time t0.

Parameters:
  • profile (numpy.array) – Sequence profile. Shape = (L, a), where L - sequence length, a - alphabet size.

  • t (double) – Time to propagate

  • return_log (bool) – If True, return log-probability

Returns:

res – Profile of the sequence after time t in the past. Shape = (L, a), where L - sequence length, a - alphabet size.

Return type:

np.array

GTR.sequence_logLH(seq, pattern_multiplicity=None)[source]

Returns the log-likelihood of sampling a sequence from equilibrium frequency. Expects a sequence as numpy array

Parameters:
  • seq (numpy array) – Compressed sequence as an array of chars

  • pattern_multiplicity (numpy_array) – The number of times each position in sequence is observed in the initial alignment. If None, sequence is assumed to be not compressed

GTR.expQt(t)[source]
Parameters:

t (float) – Time to propagate

Returns:

expQt – Matrix exponential of exo(Qt)

Return type:

numpy.array