Question

Build Phylogenetic Trees From Bam Files

1

Entering edit mode

10.4 years ago

Luca Beltrame ▴ 240

Hello,

more and more publications are showing phylogenetic trees (often rootless) to represent similarities between samples from the same origin (e.g. tumors). I've set out trying to do the same thing, except that all my searches didn't find anything, neither a way to do this directly, nor a way to do it via conversion. I only found a program called POPBAM which supposedly generates trees but it doesn't really work (segfaults immediately).

Given a series of BAM files, what would be the best way to use/convert them to a format that can be used to build trees?

sequencing bam • 6.2k views

ADD COMMENT • link updated 10.2 years ago by Fwip ▴ 500 • written 10.4 years ago by Luca Beltrame ▴ 240

Ram · Answer 1 · 2014-02-19

I've found more success with the POPBAM source from github. (I used commit 0cdbacc2fd869e0b65a64bf5ff38ca1c21f41657)

The important thing to note, though, is the header adjustment you need to do in order for POPBAM to recognize your samples. From http://popbam.sourceforge.net/:

To enable POPBAM to perform population-level analyses, it is first necessary to modify the input BAM file header. Users must add the "PO" tag to the header line for each read group. The "PO" tag can be any string, as long as the string is identical between samples from the same population. One example may be that a BAM file has three read groups (R21, R22, and R25). The R22 and R25 read groups are from two different lines of Drosophila melanogaster called "MEL01" and "MEL02", while the third read group, R21, is from a single line of D. simulans called "SIM01". Below is an example of the BAM header including the "PO" tag:
@RG  ID:R22  SM:MEL01  PO:MEL
@RG  ID:R25  SM:MEL02  PO:MEL
@RG  ID:R21  SM:SIM01  PO:SIM
  

First, be sure to include readgroup information:

samtools merge -rh group1.header.txt group1.bam CD3674.bam CD3688.bam CD3692.bam CD3700.bam CD3719.bam

group1.header.txt:

 @HD VN:1.3  SO:coordinate                       
 @SQ SN:NC_009089  LN:4290252  AS:NC_009089      
 @RG ID:CD3674 SM:CD3674 PO:CD3674               
 @RG ID:CD3688 SM:CD3688 PO:CD3688               
 @RG ID:CD3692 SM:CD3692 PO:CD3692               
 @RG ID:CD3700 SM:CD3700 PO:CD3700               
 @RG ID:CD3719 SM:CD3719 PO:CD3719

And finally, run as so:

popbam tree -f ref.fasta NC_009089:1-42000000 -o group1.txt > group1.tree

(As far as I can tell, the region is required, the -o output file is ignored, and output is written to stdout.)

Ram · Answer 2 · 2013-12-12

0

Entering edit mode

10.4 years ago

Fabio Marroni ★ 3.0k

You may:

Post a sample of input file and the exact error message of POPBAM so that maybe someone might help you (I never used popbam)
Use bam to build a consensus sequence and then compare consensus sequences (which will be in fasta format) to build a phylogenetic tree. I don't knowhow phylogenetic software packages will behave if you give them sequences that may be gigabases in size.
Use BAM to obtain SNPs, then use SNPs to represent genetic distance between any two samples and then use the distance matrix as input in a phylogeny inference package.

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 10.4 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

Thanks for the answer. I'm using retargeted sequencing on small number of targets (30 genes), so I'm assuming it won't be a big problem.

When you mention a consensus sequence, you mean using stuff like pileup to generate it?

ADD REPLY • link 10.4 years ago by Luca Beltrame ▴ 240

0

Entering edit mode

Pileup might work. There are plenty of tools that go from pileup to consensus (varscan, gakt...)

ADD REPLY • link 10.4 years ago by Fabio Marroni ★ 3.0k