TreeAnc class documentation

This is the core class of the TreeTime module. It stores the phylogenetic tree and implements the basic algorithms for sequence manipulation, sequence reconstruction, and branch length optimization.

The tree is stored as Bio.Phylo object. In order to facilitate the tree operations, each node of the tree is decorated with additional attributes which are set during the tree preparation. These attributes need to be updated after tree modifications. The sequences are also attached to the tree nodes. In order to save memory, the sequences are stored in the compressed form. The TreeAnc class implements methods to compress and decompress sequences.

The main purpose of the TreeAnc class is to implement standard algorithms for ancestral sequence reconstruction. Both marginal and joint maximum likelihood reconstructions are possible. The marginal reconstructions computes the entire distribution of the states at a given node after tracing out states at all other nodes.

The example scripts illustrate how to instantiate TreeAnc objects.

TreeAnc Constructor

class treetime.TreeAnc(tree=None, aln=None, gtr=None, fill_overhangs=True, ref=None, verbose=3, ignore_gaps=True, convert_upper=True, seq_multiplicity=None, log=None, compress=True, seq_len=None, ignore_missing_alns=False, keep_node_order=False, rng_seed=None, **kwargs)[source]

Class defines simple tree object with basic interface methods: reading and saving from/to files, initializing leaves with sequences from the alignment, making ancestral state inference

__init__(tree=None, aln=None, gtr=None, fill_overhangs=True, ref=None, verbose=3, ignore_gaps=True, convert_upper=True, seq_multiplicity=None, log=None, compress=True, seq_len=None, ignore_missing_alns=False, keep_node_order=False, rng_seed=None, **kwargs)[source]

TreeAnc constructor. It prepares the tree, attaches sequences to the leaf nodes, and sets some configuration parameters.

Parameters:
  • tree (str, Bio.Phylo.Tree) – Phylogenetic tree. String passed is interpreted as a filename with a tree in a standard format that can be parsed by the Biopython Phylo module. Branch length should be in units of average number of nucleotide or protein substitutions per site. Use on trees with longer branches (>4) is not recommended.

  • aln (str, Bio.Align.MultipleSequenceAlignment, dict) – Sequence alignment. If a string passed, it is interpreted as the filename to read Biopython alignment from. If a dict is given, this is assumed to be the output of vcf_utils.read_vcf which specifies for each sequence the differences from a reference

  • gtr (str, GTR) – GTR model object. If string passed, it is interpreted as the type of the GTR model. A new GTR instance will be created for this type.

  • fill_overhangs (bool, default True) – In some cases, the missing data on both ends of the alignment is filled with the gap sign(‘-‘). If set to True, the end-gaps are converted to “unknown” characters (‘N’ for nucleotides, ‘X’ for aminoacids). Otherwise, the alignment is treated as-is

  • ref (None, optional) – Reference sequence used in VCF mode

  • verbose (int, default 3) – Verbosity level as number from 0 (lowest) to 10 (highest).

  • ignore_gaps (bool, default True) – Ignore gaps in branch length calculations

  • convert_upper (bool, default True) – Convert all sequences to upper case

  • seq_multiplicity (dict) – If individual nodes in the tree correspond to multiple sampled sequences (i.e. read count in a deep sequencing experiment), these can be specified as a dictionary. This currently only affects rooting and can be used to weigh individual tips by abundance or important during root search.

  • compress (bool, default True) – reduce identical alignment columns to one (not useful when inferring site specific GTR models).

  • seq_len (int, optional) – length of the sequence. this is inferred from the input alignment or the reference sequence in most cases but can be specified for other applications.

  • ignore_missing_alns (bool, default False) –

  • **kwargs

    Keyword arguments to construct the GTR model

    Note

    Some GTR types require additional configuration parameters. If the new GTR is being instantiated, these parameters are expected to be passed as kwargs. If nothing is passed, the default values are used, which might cause unexpected results.

Raises:

AttributeError – If no tree is passed in

TreeAnc methods

Basic functions, utilities, properties

TreeAnc.prepare_tree()[source]

Set link to parent and calculate distance to root for all tree nodes. Should be run once the tree is read and after every rerooting, topology change or branch length optimizations.

TreeAnc.prune_short_branches()[source]

If the branch length is less than the minimal value, remove the branch from the tree. Requires ancestral sequence reconstruction

TreeAnc.set_gtr(in_gtr, **kwargs)[source]

Create new GTR model if needed, and set the model as an attribute of the TreeAnc class

Parameters:
  • in_gtr (str, GTR) – The gtr model to be assigned. If string is passed, it is taken as the name of a standard GTR model, and is attempted to be created through GTR.standard() interface. If a GTR instance is passed, it is set directly .

  • **kwargs – Keyword arguments to construct the GTR model. If none are passed, defaults are assumed.

TreeAnc.logger(msg, level, warn=False, only_once=False)[source]

Print log message msg to stdout.

Parameters:
  • msg (str) – String to print on the screen

  • level (int) – Log-level. Only the messages with a level higher than the current verbose level will be shown.

  • warn (bool) – Warning flag. If True, the message will be displayed regardless of its log-level.

TreeAnc.aln()
Setter:

Sets the alignment

Getter:

Returns the alignment

TreeAnc.gtr()
Setter:

Sets the GTR object passed in

Getter:

Returns the current GTR object

TreeAnc.tree()

The phylogenetic tree currently used by the TreeAnc.

Setter:

Sets the tree. Directly if passed as Phylo.Tree, or by reading from file if passed as a str.

Getter:

Returns the tree as a Phylo.Tree object

TreeAnc.leaves_lookup()

The {leaf-name:leaf-node} dictionary. It enables fast search of a tree leaf object by its name.

Ancestral reconstruction and tree optimization

TreeAnc.infer_ancestral_sequences(method='probabilistic', infer_gtr=False, marginal=False, reconstruct_tip_states=False, **kwargs)[source]

Reconstruct ancestral sequences

Parameters:
  • method (str) – Method to use. Supported values are “parsimony”, “fitch”, “probabilistic” and “ml”

  • infer_gtr (bool) – Infer a GTR model before reconstructing the sequences

  • marginal (bool) – Assign sequences that are most likely after averaging over all other nodes instead of the jointly most likely sequences.

  • reconstruct_tip_states (bool, optional) – Reconstruct sequences of terminal nodes/leaves, thereby replacing ambiguous characters with the inferred base/state. default: False

  • **kwargs – additional keyword arguments that are passed down to TreeAnc.infer_gtr() and TreeAnc._ml_anc()

Returns:

N_diff – Number of nucleotides different from the previous reconstruction. If there were no pre-set sequences, returns N*L

Return type:

int

TreeAnc.sequence_LH(pos=None, full_sequence=False)[source]

return the likelihood of the observed sequences given the tree

Parameters:
  • pos (int, optional) – position in the sequence, if none, the sum over all positions will be returned

  • full_sequence (bool, optional) – does the position refer to the full or compressed sequence, by default compressed sequence is assumed.

Returns:

likelihood

Return type:

float

TreeAnc.optimize_tree(prune_short=True, marginal_sequences=False, branch_length_mode='joint', max_iter=5, infer_gtr=False, pc=1.0, method_anc='probabilistic', **kwargs)[source]

Iteratively set branch lengths and reconstruct ancestral sequences until the values of either former or latter do not change. The algorithm assumes knowing only the topology of the tree, and requires that sequences are assigned to all leaves of the tree.

The first step is to pre-reconstruct ancestral states using Fitch reconstruction algorithm or ML using existing branch length estimates. Then, optimize branch lengths and re-do reconstruction until convergence using ML method.

Parameters:
  • prune_short (bool) – If True, the branches with zero optimal length will be pruned from the tree, creating polytomies. The polytomies could be further processed using treetime.TreeTime.resolve_polytomies() from the TreeTime class.

  • marginal_sequences (bool) – Assign sequences to their marginally most likely value, rather than the values that are jointly most likely across all nodes.

  • branch_length_mode (str) – ‘joint’, ‘marginal’, or ‘input’. Branch lengths are left unchanged in case of ‘input’. ‘joint’ and ‘marginal’ cause branch length optimization while setting sequences to the ML value or tracing over all possible internal sequence states.

  • max_iter (int) – Maximal number of times sequence and branch length iteration are optimized

  • infer_gtr (bool) – Infer a GTR model from the observed substitutions.

  • method_anc (str) – Which method should be used to reconstruct ancestral sequences. Supported values are “parsimony”, “fitch”, “probabilistic” and “ml”

TreeAnc.infer_gtr(marginal=False, site_specific=False, normalized_rate=True, fixed_pi=None, pc=5.0, **kwargs)[source]

Calculates a GTR model given the multiple sequence alignment and the tree. It performs ancestral sequence inferrence (joint or marginal), followed by the branch lengths optimization. Then, the numbers of mutations are counted in the optimal tree and related to the time within the mutation happened. From these statistics, the relative state transition probabilities are inferred, and the transition matrix is computed.

The result is used to construct the new GTR model of type ‘custom’. The model is assigned to the TreeAnc and is used in subsequent analysis.

Parameters:
  • print_raw (bool) – If True, print the inferred GTR model

  • marginal (bool) – If True, use marginal sequence reconstruction

  • normalized_rate (bool) – If True, sets the mutation rate prefactor to 1.0.

  • fixed_pi (np.array) – Provide the equilibrium character concentrations. If None is passed, the concentrations will be inferred from the alignment.

  • pc (float) – Number of pseudo counts to use in gtr inference

Returns:

gtr – The inferred GTR model

Return type:

GTR

TreeAnc.get_tree_dict(keep_var_ambigs=False)[source]