TreeAnc class documentation

This is the core class of the TreeTime module. It stores the phylogenetic tree and implements the basic algorithms for sequence manipulation, sequence reconstruction, and branch length optimization.

The tree is stored as Bio.Phylo object. In order to facilitate the tree operations, each node of the tree is decorated with additional attributes which are set during the tree preparation. These attributes need to be updated after tree modifications. The sequences are also attached to the tree nodes. In order to save memory, the sequences are stored in the compressed form. The TreeAnc class implements methods to compress and decompress sequences.

The main purpose of the TreeAnc class is to implement standard algorithms for ancestral sequence reconstruction. Both marginal and joint maximum likelihood reconstructions are possible. The marginal reconstructions computes the entire distribution of the states at a given node after tracing out states at all other nodes.

The example scripts illustrate how to instantiate TreeAnc objects.

TreeAnc Constructor

class treetime.TreeAnc(tree=None, aln=None, gtr=None, fill_overhangs=True, ref=None, verbose=3, ignore_gaps=True, convert_upper=True, seq_multiplicity=None, log=None, compress=True, seq_len=None, **kwargs)[source]

Class defines simple tree object with basic interface methods: reading and saving from/to files, initializing leaves with sequences from the alignment, making ancestral state inference

__init__(tree=None, aln=None, gtr=None, fill_overhangs=True, ref=None, verbose=3, ignore_gaps=True, convert_upper=True, seq_multiplicity=None, log=None, compress=True, seq_len=None, **kwargs)[source]

TreeAnc constructor. It prepares the tree, attaches sequences to the leaf nodes, and sets some configuration parameters.

Parameters
  • tree (str, Bio.Phylo.Tree) – Phylogenetic tree. String passed is interpreted as a filename with a tree in a standard format that can be parsed by the Biopython Phylo module.

  • aln (str, Bio.Align.MultipleSequenceAlignment, dict) – Sequence alignment. If a string passed, it is interpreted as the filename to read Biopython alignment from. If a dict is given, this is assumed to be the output of vcf_utils.read_vcf which specifies for each sequence the differences from a reference

  • gtr (str, GTR) – GTR model object. If string passed, it is interpreted as the type of the GTR model. A new GTR instance will be created for this type.

  • fill_overhangs (bool) – In some cases, the missing data on both ends of the alignment is filled with the gap sign(‘-‘). If set to True, the end-gaps are converted to “unknown” characters (‘N’ for nucleotides, ‘X’ for aminoacids). Otherwise, the alignment is treated as-is

  • ref (None, optional) – Reference sequence used in VCF mode

  • verbose (int) – Verbosity level as number from 0 (lowest) to 10 (highest).

  • ignore_gaps (bool) – Ignore gaps in branch length calculations

  • convert_upper (bool, optional) – Description

  • seq_multiplicity (dict) – If individual nodes in the tree correspond to multiple sampled sequences (i.e. read count in a deep sequencing experiment), these can be specified as a dictionary. This currently only affects rooting and can be used to weigh individual tips by abundance or important during root search.

  • compress (bool, optional) – reduce identical alignment columns to one (not useful when inferring site specific GTR models).

  • seq_len (int, optional) – length of the sequence. this is inferred from the input alignment or the reference sequence in most cases but can be specified for other applications.

  • **kwargs

    Keyword arguments to construct the GTR model

    Note

    Some GTR types require additional configuration parameters. If the new GTR is being instantiated, these parameters are expected to be passed as kwargs. If nothing is passed, the default values are used, which might cause unexpected results.

Raises

AttributeError – If no tree is passed in

TreeAnc methods

Basic functions, utilities, properties

TreeAnc.prepare_tree()[source]

Set link to parent and calculate distance to root for all tree nodes. Should be run once the tree is read and after every rerooting, topology change or branch length optimizations.

TreeAnc.prune_short_branches()[source]

If the branch length is less than the minimal value, remove the branch from the tree. Requires ancestral sequence reconstruction

TreeAnc.set_gtr(in_gtr, **kwargs)[source]

Create new GTR model if needed, and set the model as an attribute of the TreeAnc class

Parameters
  • in_gtr (str, GTR) – The gtr model to be assigned. If string is passed, it is taken as the name of a standard GTR model, and is attempted to be created through GTR.standard() interface. If a GTR instance is passed, it is set directly .

  • **kwargs – Keyword arguments to construct the GTR model. If none are passed, defaults are assumed.

TreeAnc.logger(msg, level, warn=False, only_once=False)[source]

Print log message msg to stdout.

Parameters
  • msg (str) – String to print on the screen

  • level (int) – Log-level. Only the messages with a level higher than the current verbose level will be shown.

  • warn (bool) – Warning flag. If True, the message will be displayed regardless of its log-level.

TreeAnc.aln()
Setter

Sets the alignment

Getter

Returns the alignment

TreeAnc.gtr()
Setter

Sets the GTR object passed in

Getter

Returns the current GTR object

TreeAnc.tree()

The phylogenetic tree currently used by the TreeAnc.

Setter

Sets the tree. Directly if passed as Phylo.Tree, or by reading from file if passed as a str.

Getter

Returns the tree as a Phylo.Tree object

TreeAnc.leaves_lookup()

The {leaf-name:leaf-node} dictionary. It enables fast search of a tree leaf object by its name.

Ancestral reconstruction and tree optimization

TreeAnc.infer_ancestral_sequences(method='probabilistic', infer_gtr=False, marginal=False, reconstruct_tip_states=False, **kwargs)[source]

Reconstruct ancestral sequences

Parameters
  • method (str) – Method to use. Supported values are “parsimony”, “fitch”, “probabilistic” and “ml”

  • infer_gtr (bool) – Infer a GTR model before reconstructing the sequences

  • marginal (bool) – Assign sequences that are most likely after averaging over all other nodes instead of the jointly most likely sequences.

  • reconstruct_tip_states (bool, optional) – Reconstruct sequences of terminal nodes/leaves, thereby replacing ambiguous characters with the inferred base/state. default: False

  • **kwargs – additional keyword arguments that are passed down to TreeAnc.infer_gtr() and TreeAnc._ml_anc()

Returns

N_diff – Number of nucleotides different from the previous reconstruction. If there were no pre-set sequences, returns N*L

Return type

int

TreeAnc.sequence_LH(pos=None, full_sequence=False)[source]

return the likelihood of the observed sequences given the tree

Parameters
  • pos (int, optional) – position in the sequence, if none, the sum over all positions will be returned

  • full_sequence (bool, optional) – does the position refer to the full or compressed sequence, by default compressed sequence is assumed.

Returns

likelihood

Return type

float

TreeAnc.optimize_tree(prune_short=True, marginal_sequences=False, branch_length_mode='joint', max_iter=5, infer_gtr=False, pc=1.0, **kwargs)[source]

Iteratively set branch lengths and reconstruct ancestral sequences until the values of either former or latter do not change. The algorithm assumes knowing only the topology of the tree, and requires that sequences are assigned to all leaves of the tree.

The first step is to pre-reconstruct ancestral states using Fitch reconstruction algorithm or ML using existing branch length estimates. Then, optimize branch lengths and re-do reconstruction until convergence using ML method.

Parameters
  • prune_short (bool) – If True, the branches with zero optimal length will be pruned from the tree, creating polytomies. The polytomies could be further processed using treetime.TreeTime.resolve_polytomies() from the TreeTime class.

  • marginal_sequences (bool) – Assign sequences to their marginally most likely value, rather than the values that are jointly most likely across all nodes.

  • branch_length_mode (str) – ‘joint’, ‘marginal’, or ‘input’. Branch lengths are left unchanged in case of ‘input’. ‘joint’ and ‘marginal’ cause branch length optimization while setting sequences to the ML value or tracing over all possible internal sequence states.

  • max_iter (int) – Maximal number of times sequence and branch length iteration are optimized

  • infer_gtr (bool) – Infer a GTR model from the observed substitutions.

TreeAnc.infer_gtr(marginal=False, site_specific=False, normalized_rate=True, fixed_pi=None, pc=5.0, **kwargs)[source]

Calculates a GTR model given the multiple sequence alignment and the tree. It performs ancestral sequence inferrence (joint or marginal), followed by the branch lengths optimization. Then, the numbers of mutations are counted in the optimal tree and related to the time within the mutation happened. From these statistics, the relative state transition probabilities are inferred, and the transition matrix is computed.

The result is used to construct the new GTR model of type ‘custom’. The model is assigned to the TreeAnc and is used in subsequent analysis.

Parameters
  • print_raw (bool) – If True, print the inferred GTR model

  • marginal (bool) – If True, use marginal sequence reconstruction

  • normalized_rate (bool) – If True, sets the mutation rate prefactor to 1.0.

  • fixed_pi (np.array) – Provide the equilibrium character concentrations. If None is passed, the concentrations will be inferred from the alignment.

  • pc (float) – Number of pseudo counts to use in gtr inference

Returns

gtr – The inferred GTR model

Return type

GTR

TreeAnc.get_tree_dict(keep_var_ambigs=False)[source]