Background
We have coordinates for putative promoter regions, of the form <chromosome number>:<start>-<end>, e.g. 1:1-1500, one set for each of four breeds of Potato (S. tuberosum).
We have also assembled genomes, whose output is in the form of sorted.bam files, again one for each of four potato breeds, which we have created by using bwa mem by mapping four sets of read-pairs against the soltub3.0 reference genome from Ensembl plants.
Goal
My immediate task is to obtain a single FASTA sequence for every putative promoter for just one of our assembled genome breeds, which we can call the "Georgina" breed.
Progress so far
Since we have both (a) a sorted.bam genome, and (b) a list of coordinates for our promoters for the Georgina breed, for each coordinate I can do this:
samtools view input.bam 1:18000-45500
Which outputs to the terminal all reads falling within that region.
Question
Is there a way to output/pipe the samtools view statement, above, either using a flag/other program so that instead of multiple reads falling within a region, the output is a single "consensus" sequence of the multiple reads falling within a region, whole output is perhaps as a FASTA file rather than a bam/sam/etc. output?
Huge thanks in advance, and the your time in reading this.
After you slice the BAM file using
samtools view
use Generating consensus sequence from bam file or generate consensus on your entire file and then slice the sequence you want out.Hi GenoMax , I will do that (really appreciate your quick reply thank you).
Before I go back to the Computer Lab tomorrow just to be sure so you mean:
It's slightly counter-intuitive why I'd need the ref.fa part, but I will treat it as magical and will give it a go, thanks again!
ref.fa
is the reference your data is aligned against. Replace that with your own reference file. You are essentially calling SNP's in your aligned data and then creating a new consensus using the SNP VCF file.Oh I see, you explained that really well thanks. (It's no longer magic as Arthur C Clarke would say).
Down the line, we might be doing some de novo re-assembly of the genomes... Im dreading this since we will then lack a ref.fa...
Eternally grateful GenoMax !