VCF Utilities

treetime.vcf_utils.read_vcf(vcf_file, ref_file=None)[source]

Reads in a vcf/vcf.gz file and associated reference sequence fasta (to which the VCF file is mapped).

Parses mutations, insertions, and deletions and stores them in a nested dict, see ‘returns’ for the dict structure.

Calls with heterozygous values 0/1, 0/2, etc and no-calls (./.) are replaced with Ns at the associated sites.

Positions are stored to correspond the location in the reference sequence in Python (numbering is transformed to start at 0)

Parameters:
  • vcf_file (string) – Path to the vcf or vcf.gz file to be read in

  • ref_file (string, optional) – Path to the fasta reference file to be read in

Returns:

compress_seq

In the format:

{
'reference':'AGCTCGA..A',
'sequences': { 'seq1':{4:'A', 7:'-'}, 'seq2':{100:'C'} },
'insertions': { 'seq1':{4:'ATT'}, 'seq3':{1:'TT', 10:'CAG'} },
'positions': [1,4,7,10,100...]
}
referencesstring

String of the reference sequence read from the Fasta, to which the variable sites are mapped

sequencesnested dict

Dict containing sequence names as keys which map to dicts that have position as key and the single-base mutation (or deletion) as values

insertionsnested dict

Dict in the same format as the above, which stores insertions and their locations. The first base of the insertion is the same as whatever is currently in that position (Ref if no mutation, mutation in ‘sequences’ otherwise), so the current base can be directly replaced by the bases held here.

positionslist

Python list of all positions with a mutation, insertion, or deletion.

Return type:

nested dict

treetime.vcf_utils.write_vcf(tree_dict, file_name)[source]

Writes out a VCF-style file (which seems to be minimally handleable by vcftools and pyvcf) of the alignment. This is created from a dict in a similar format to what’s created by treetime.vcf_utils.read_vcf()

Positions of variable sites are transformed to start at 1 to match VCF convention.

Parameters:
  • tree_dict (nested dict) – A nested dict with keys ‘sequence’ ‘reference’ and ‘positions’, as is created by treetime.TreeAnc.get_tree_dict()

  • file_name (str) – File to which the new VCF should be written out. File names ending with ‘.gz’ will result in the VCF automatically being gzipped.