Converting samtools view <region> output (multiple reads) into single fasta sequence (consensus?)
0
0
Entering edit mode
12 weeks ago
a1ultima ▴ 760

Background

We have coordinates for putative promoter regions, of the form <chromosome number>:<start>-<end>, e.g. 1:1-1500, one set for each of four breeds of Potato (S. tuberosum).

We have also assembled genomes, whose output is in the form of sorted.bam files, again one for each of four potato breeds, which we have created by using bwa mem by mapping four sets of read-pairs against the soltub3.0 reference genome from Ensembl plants.

Goal

My immediate task is to obtain a single FASTA sequence for every putative promoter for just one of our assembled genome breeds, which we can call the "Georgina" breed.

Progress so far

Since we have both (a) a sorted.bam genome, and (b) a list of coordinates for our promoters for the Georgina breed, for each coordinate I can do this:

samtools view input.bam 1:18000-45500


Which outputs to the terminal all reads falling within that region.

Question

Is there a way to output/pipe the samtools view statement, above, either using a flag/other program so that instead of multiple reads falling within a region, the output is a single "consensus" sequence of the multiple reads falling within a region, whole output is perhaps as a FASTA file rather than a bam/sam/etc. output?

coordinate dna samtools view consensus • 320 views
1
Entering edit mode

After you slice the BAM file using samtools view use Generating consensus sequence from bam file or generate consensus on your entire file and then slice the sequence you want out.

0
Entering edit mode

Hi GenoMax , I will do that (really appreciate your quick reply thank you).

Before I go back to the Computer Lab tomorrow just to be sure so you mean:

bcftools mpileup -Ou -f ref.fa input.bam | bcftools call -Ou -mv | bcftools norm -f **ref.fa** Oz -o output.vcf.gz


It's slightly counter-intuitive why I'd need the ref.fa part, but I will treat it as magical and will give it a go, thanks again!

1
Entering edit mode

ref.fa is the reference your data is aligned against. Replace that with your own reference file. You are essentially calling SNP's in your aligned data and then creating a new consensus using the SNP VCF file.

0
Entering edit mode

Oh I see, you explained that really well thanks. (It's no longer magic as Arthur C Clarke would say).

Down the line, we might be doing some de novo re-assembly of the genomes... Im dreading this since we will then lack a ref.fa...

Eternally grateful GenoMax !