Question

Multiple sequence alignment and editing following consensus sequence generation in Samtools

1

Entering edit mode

9.5 years ago

jal ▴ 10

Starting with pool-seq data, I'm aligning reads to the chloroplast genome and after filtering, using samtools to generate a consensus sequence for each of several populations. I would ultimately like to build a phylogeny from these sequences but am not sure of best practices between consensus sequence generation and input into standard phylogenetics tools.

I think I need to do the following:

Generate a multiple sequence alignment
Visually inspect the alignment and remove obvious errors (e.g. large indels with respect to the reference?)
Verify SNPs (e.g. go back to the original alignment of reads for each population)

I'm wondering if anyone can provide some guidance as to which programs are most useful for these steps and the data visualization? Also any suggestions as to what types of errors to look out for given this data and how to decide on reliable SNPs would be greatly appreciated as I'm very new to the genomics end of things.

sequence-alignment next-gen • 4.2k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.5 years ago by jal ▴ 10

Ram · Answer 1 · 2015-01-16

If you did not find the solution yet,

This approach might help you.

Get a BAM file from genome alignment.
Filter the BAM file and sort it.
Perform indel realignment (very important) and call SNPs (may be using samtools, it will be faster)
Now convert the BAM files to BED files --> Merge BED.
Use multiIntersectBed from bed tools to extract the regions that are present at least in 'n' number of samples ( column 4 show the number of samples that region is present ) and save the output.
Use FastaAlternateReferenceMaker of GATK with option --useIUPACand input each VCF file with the Bed file created in step 5 with option -L, so that only regions present in that Bed file will be extracted. This will output the regions of BED file in fast format with the SNP placed wherever there is SNP/INDEL in the VCF file.

Generate fasta for all the samples and stitch all the sequences in the same order to make a single sequence and do a multiple sequence alignment and then phylogeny. Let me know if you have any confusion in the method.