Question: Building gene trees
gravatar for Colaptes
13 months ago by
Colaptes40 wrote:

Hi, I have low-depth genome sequencing data from several species in a family and I would like to build gene trees as input for analyses with programs like BUCKy and Astral. (I want to estimate a species tree, and assess evidence for introgression between nonsister lineages). I have mapped the reads to the closest outgroup available and produced VCF files for each scaffold.

I am wondering whether "gene trees" need to be real genes, or whether it is acceptable to simply cut the VCF file into small pieces and build "gene" trees of these pieces? It seems like most papers use real genes however I do not have a transcriptome for my organisms unless I used a fairly distant outgroup as a reference to assemble the gene sequences.

ADD COMMENTlink modified 13 months ago by Brice Sarver3.5k • written 13 months ago by Colaptes40
gravatar for Brice Sarver
13 months ago by
Brice Sarver3.5k
United States
Brice Sarver3.5k wrote:

Gene trees don't need to be generated from 'genes' in the sense that they're absolutely restricted to coding loci, but you should be working at a fine enough scale where you'd expect a similar model of sequence evolution and linkage to apply. I've done this estimating gene trees for loci concatenated across chromosomes and by reducing the scale down to exons within a 100kbp window or so, looking for any discordance in support and results once thrown into species tree analyses.

Be careful, though, if you're using model-based phylogenetics to estimate your gene trees - you'll want to include confidently called reference alleles as well to flesh out your alignments and allow for proper estimation of a proportion of invariable sites and/or parameters in the gamma distribution describing among-site rate heterogeneity. Multisample VCFs, once converted to FASTA/Nexus to estimate your gene trees, will only include variable positions and may not be appropriate. You'll also want to be extremely careful with your filtering criteria, especially if you're working with low-coverage data that could introduce biases.

Another point of advice: it may be easier to manipulate FASTA files programmatially as opposed to manipulating the VCFs early on since you ultimately need to work with multispecies alignments per gene/locus/region/scaffold. Check out SeqIO in Biopython and Biostrings in R (part of Bioconductor).

ADD COMMENTlink modified 13 months ago • written 13 months ago by Brice Sarver3.5k

Thank you very much for the advice! That is very helpful.

ADD REPLYlink written 13 months ago by Colaptes40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1522 users visited in the last hour