Question

getting a consense one header .fasta file from a genic region, from .bam template

2

Entering edit mode

6.7 years ago

ricfoz ▴ 100

I started with a 9GB file of a human Xth chromosome with a .bam format, and I am trying to get a specific genic region in .fasta classic format (you know, a one header file, starting with ">", with description on header, and a single line, of continuous tandem single nucleotides).

I have been able to retrieve the reads of the genic region I want in bam format. I share the path I have followed:

samtools sort OriginalFile.bam -o OriginalFile.sort.bam
samtools inex OriginalFile.sort.bam
samtools view -b OriginalFile.sort.bam RegName:XX-XX+1 > GeneName.bam

after this I got the GeneName.bam file which contains all the reads on which I am interested in, but I can't run the phylogenetic tool I am trying to pipe the GeneName.bam sequence to ... in order to follow my workpath, I need to transform this GeneName.bam file, into this classic .fasta format I described lines ago.

I tried a couple of pipelines in order to achieve this, but what I have got as a retrieved .fasta file, is a man readable file, but with all the single fastq reads, with headers and everything, stacked on top of each other. I need to get the consensus of this, with one header and only contiguous nucleotides.

I got GeneName.fasta file running:

samtools bam2fq GeneName.bam > GeneName.fasta

or

samtools bam2fq GeneName.bam > GeneName.fastq
samtools seqtk seq -A GeneName.fastq > GeneName.fasta # (i installed bowtie2, boost, tophat2, seqtk, and bcf tools, as they seem to complement each other's works at some times)

and I got the same result on both, analogous file.

With this line of thinking, I thought I may have to run mpileup to the GeneName.bam file, before transforming to fasta, but mpileup gives as a result .bcf format... still, this last assumption is just a form of discussing my problem.

Have anyone out there found the solution for this, can a .bam file converted to classic .fasta format, or can help in any way?

Greetings

fasta bam • 3.2k views

ADD COMMENT • link updated 16 months ago by Ram 44k • written 6.7 years ago by ricfoz ▴ 100

0

Entering edit mode

See also Converting .bam file to .fasta file from a genic region

ADD REPLY • link 6.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes, that post is mine... i have posted a couple of times on the matter as i have been working through the problem ... now i am at the point at which i can retrieve the genicregion.fasta file, but it gives only single fastq reads piled up upon each other... i want a .fasta format, which is completely different from .fastq single read.

ADD REPLY • link 6.7 years ago by ricfoz ▴ 100

0

Entering edit mode

now that you got fastq file, try converting to fasta file using tools such as: https://bioconda.github.io/recipes/ucsc-fastqtofa/README.html. This would give you each read as single fasta record. You can download the binaries from http://hgdownload.cse.ucsc.edu/admin/exe/ and select as per your platform.

ADD REPLY • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

That's not what OP needs, rather a de novo assembly. OP wants a "reference" fasta for that region if I understood correctly, or create a consensus sequence of the bam file.

ADD REPLY • link 6.7 years ago by WouterDeCoster 47k

score 3 · Accepted Answer · 2017-11-23

Alright, i solved my problem, so i share what i found to the community so you can solve similar problems

after retrieving a .bam file with the desired length, through the command "samtools view -b ChrName:XX-XX+1 > Outputname.bam"

you need to pipe this output into samtools mpileup command, in order to get a .bcf file

"samtools mpileup -g -f refgenome.fa filename.bam > filename.bcf" (check samtools tutorials to change or check/confirm desired flag options)

then you need to transform .bcf to .vcf callig: "bcftools call -m filename.bcf > filename .vcf"

this last .vcf file needs to be compressed and indexed like following:

bgzip -c filename.vcf > filename.vcf.gz

tabix filename.vcf.gz

after having succesfully run these commands, you can get to the final step, using samtools again:

samtools faidx refgenome.fa ChrName:XX-XX+1 | bcftools consensus filename.vcf.gz > filename.fa

I hope these pipeline is useful for you people, and i thank all the community, and people who aided on the go. Greetings!