Question

Building gene trees

1

Entering edit mode

5.8 years ago

Colaptes ▴ 110

Hi, I have low-depth genome sequencing data from several species in a family and I would like to build gene trees as input for analyses with programs like BUCKy and Astral. (I want to estimate a species tree, and assess evidence for introgression between nonsister lineages). I have mapped the reads to the closest outgroup available and produced VCF files for each scaffold.

I am wondering whether "gene trees" need to be real genes, or whether it is acceptable to simply cut the VCF file into small pieces and build "gene" trees of these pieces? It seems like most papers use real genes however I do not have a transcriptome for my organisms unless I used a fairly distant outgroup as a reference to assemble the gene sequences.

Bucky Astral Gene tree Species Tree • 2.1k views

ADD COMMENT • link updated 5.8 years ago by Brice Sarver ★ 3.8k • written 5.8 years ago by Colaptes ▴ 110

score 2 · Accepted Answer · 2019-09-17

Gene trees don't need to be generated from 'genes' in the sense that they're absolutely restricted to coding loci, but you should be working at a fine enough scale where you'd expect a similar model of sequence evolution and linkage to apply. I've done this estimating gene trees for loci concatenated across chromosomes and by reducing the scale down to exons within a 100kbp window or so, looking for any discordance in support and results once thrown into species tree analyses.

Be careful, though, if you're using model-based phylogenetics to estimate your gene trees - you'll want to include confidently called reference alleles as well to flesh out your alignments and allow for proper estimation of a proportion of invariable sites and/or parameters in the gamma distribution describing among-site rate heterogeneity. Multisample VCFs, once converted to FASTA/Nexus to estimate your gene trees, will only include variable positions and may not be appropriate. You'll also want to be extremely careful with your filtering criteria, especially if you're working with low-coverage data that could introduce biases.

Another point of advice: it may be easier to manipulate FASTA files programmatially as opposed to manipulating the VCFs early on since you ultimately need to work with multispecies alignments per gene/locus/region/scaffold. Check out SeqIO in Biopython and Biostrings in R (part of Bioconductor).