Build Phylogenetic Trees From Bam Files
2
1
Entering edit mode
8.8 years ago
Luca Beltrame ▴ 240

Hello,

more and more publications are showing phylogenetic trees (often rootless) to represent similarities between samples from the same origin (e.g. tumors). I've set out trying to do the same thing, except that all my searches didn't find anything, neither a way to do this directly, nor a way to do it via conversion. I only found a program called POPBAM which supposedly generates trees but it doesn't really work (segfaults immediately).

Given a series of BAM files, what would be the best way to use/convert them to a format that can be used to build trees?

sequencing bam • 5.1k views
1
Entering edit mode
8.6 years ago
Fwip ▴ 490

I've found more success with the POPBAM source from github. (I used commit 0cdbacc2fd869e0b65a64bf5ff38ca1c21f41657)

The important thing to note, though, is the header adjustment you need to do in order for POPBAM to recognize your samples. From http://popbam.sourceforge.net/:

To enable POPBAM to perform population-level analyses, it is first necessary to modify the input BAM file header. Users must add the "PO" tag to the header line for each read group. The "PO" tag can be any string, as long as the string is identical between samples from the same population. One example may be that a BAM file has three read groups (R21, R22, and R25). The R22 and R25 read groups are from two different lines of Drosophila melanogaster called "MEL01" and "MEL02", while the third read group, R21, is from a single line of D. simulans called "SIM01". Below is an example of the BAM header including the "PO" tag:

@RG  ID:R22  SM:MEL01  PO:MEL
@RG  ID:R25  SM:MEL02  PO:MEL
@RG  ID:R21  SM:SIM01  PO:SIM


First, be sure to include readgroup information:

samtools merge -rh group1.header.txt group1.bam CD3674.bam CD3688.bam CD3692.bam CD3700.bam CD3719.bam


 @HD VN:1.3  SO:coordinate
@SQ SN:NC_009089  LN:4290252  AS:NC_009089
@RG ID:CD3674 SM:CD3674 PO:CD3674
@RG ID:CD3688 SM:CD3688 PO:CD3688
@RG ID:CD3692 SM:CD3692 PO:CD3692
@RG ID:CD3700 SM:CD3700 PO:CD3700
@RG ID:CD3719 SM:CD3719 PO:CD3719


And finally, run as so:

popbam tree -f ref.fasta NC_009089:1-42000000 -o group1.txt > group1.tree


(As far as I can tell, the region is required, the -o output file is ignored, and output is written to stdout.)

0
Entering edit mode
8.8 years ago
Fabio Marroni ★ 2.9k

You may:

1. Post a sample of input file and the exact error message of POPBAM so that maybe someone might help you (I never used popbam)
2. Use bam to build a consensus sequence and then compare consensus sequences (which will be in fasta format) to build a phylogenetic tree. I don't knowhow phylogenetic software packages will behave if you give them sequences that may be gigabases in size.
3. Use BAM to obtain SNPs, then use SNPs to represent genetic distance between any two samples and then use the distance matrix as input in a phylogeny inference package.
0
Entering edit mode

Thanks for the answer. I'm using retargeted sequencing on small number of targets (30 genes), so I'm assuming it won't be a big problem.

When you mention a consensus sequence, you mean using stuff like pileup to generate it?

0
Entering edit mode

Pileup might work. There are plenty of tools that go from pileup to consensus (varscan, gakt...)