Question: bwa-mem, from .sam/.bam to fasta file
0
gravatar for ieie
8 months ago by
ieie0
ieie0 wrote:

Dear all, I am trying to do read mapping to construct cp genomes using a reference genome and performing the mapping with bwa-mem.

I am running this command bwa mem -M -B 4 $REF $R1 $R2 > output.bam

-M to mark shorter split as secondary and -B 4 for mismatch penalty. I have two questions:

1)after I have run bwa and my reference genome fasta file is 160kb and and my fastq files are 2.70 GB should I get a bam file of 5 GB or should it be smaller?

2) when I try to convert the bam file into a fasta using samtools is there a way to have the reference genome on the top and then the reads? how should the fasta file look like in order to then use it to annotate the genomes for example in Geneious?

I hope my question are not too vague or basic. thanks a lot

bwa samtools • 613 views
ADD COMMENTlink written 8 months ago by ieie0
1

The size of the bam depends on how many reads mapped to the reference.

Also, be aware that bwa mem outputs a SAM file by default, not a BAM.

ADD REPLYlink written 8 months ago by Corentin260

The size of the bam depends on how many reads mapped to the reference

That is not true. All reads will be at the SAM/BAM file, they simply will be flagged as unmapped. Still, default (and only) output of BWA mem is SAM not BAM. Simply count reads in one of the fastq files (wc -l fastq.fastq), divide this by 4, and compare this with samtools flagstat output.sam

ADD REPLYlink written 8 months ago by ATpoint14k

thanks for your answer! I guess my main issue is that after obtaining the output.sam files of my series of reads.fastq I would like to end with a fasta multi alignment file containing the constructed cp genomes and I have looked for a pipeline to do that but with no useful solutions. so I after performing the bwa-mem I can't picture the further steps.

ADD REPLYlink written 8 months ago by ieie0
1

Once you obtain the alignments (do the SAM/BAM conversion etc.) you will need to call variants and then generate the consensus. This process is described on this page.

That said if you have access to Geneious licenses then why don't you do this entire thing in Geneious. Normally people won't recommend commercial products but if you are new to command line it may be much easier to get this done in Geneious.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax64k

hi I am doing the sorting of the bam file through:

samtools sort 103308_1116_S42_L007_R2_001.bam > 103308_1116_S42_L007_R2_001sorted.bam

but I keep getting this output:

head 103308_1116_S42_L007_R2_001sorted.bam

?BC?EN? ?0???~b????i?????%?:f?R? A??O6{??p??8?&?>?S?Sl?r??ԭ??X?~?0?v????/?u?_96???BCB=?}?d?uW?xw]?K?jʗ?o???뭊g???RTO?]?3f?t??C??c'?(xE?'$??tHA???D?????? EB|#"!?#AP??E> 8?s????V??[]?w?^????w?s?_~?^??^?????Q?7??????hDe6??r???G?gg??:??uP?ʸLfQ??I?G_5????B?wF?_?׫?jr???|????5?,????x?o??V????z?Z-????w?????xy???͏?????uW??G9??q|OM[/_??>??n??.?????G???'?0???^???/??7??s?>y??W?x|?fI???>?Qi? N??a??A?????´ ??}??|? z??w4???=??Fa9 ?(???4L{o??? ??w?4?'g?&L??B??P??C?du??P?G[???D???:e????q?2u.5u?@?o????qףu???R?;I?G?%? ????f?+?)?S?qp?^:??8?ԡwj?}??|GM]?-:dS?)N?????#??@T)k?$|d~????O[?~?????u?5?gi??"?^b?? *?M?Dy@?fT???9>7G???=?pr?H?3?n ??? S?O?÷x?C?2???>A????F?<l r????5y?%??4zr??="" ?^??????,??t?gyth??????jЯe?="" ???="" anybody="" knows="" why?="" i="" though="" you="" need="" to="" sort="" before="" doing="" the="" variant="" calling="" and="" getting="" the="" consensus="" sequence="" but="" this="" look="" really="" weird.<="" p="">

ADD REPLYlink written 8 months ago by ieie0
1

Of course. BAM is a binary (compressed) version of the SAM format. The default output of BWA is SAM.

bwa mem (options...) > out.sam
samtools sort out.sam -o out_sorted.bam

Or more elegant:

bwa mem (...) | samtools sort -o out_sorted.bam
ADD REPLYlink modified 8 months ago • written 8 months ago by ATpoint14k

Right! I run the bwa mem and I got the SAM file, then I thought to convert it in BAM before sorting, thanks a lot!

ADD REPLYlink written 8 months ago by ieie0
1

Also, if you wish to visualise the content of a BAM file you can use "samtools view out.bam"

ADD REPLYlink written 8 months ago by Corentin260

about variant calling, after sorting I followed the pipeline of the page you gave me and I get a (consensus sequence with the reference genome header, don't know why) and when in the end I run :

cat ~/DNA-Mark/Guibourtia_leonensis_Cp.fas | bcftools consensus calls.vcf.gz > consensus.fa

I get this messages for different sites The site MG564755.1:122 overlaps with another variant, skipping...do you know whether is it an error and if yes what kind of error might be?

ADD REPLYlink modified 8 months ago • written 8 months ago by ieie0

Yes, you're right (I was, wrongly, thinking about STAR, which produces by default a sam which contains only the aligned read)

ADD REPLYlink written 8 months ago by Corentin260

Really, STAR does that? Ok good to know, that means I will never use aligned BAM files from STAR to get back the original reads in fastq format. Thanks for letting me know :-D

ADD REPLYlink written 8 months ago by ATpoint14k

With STAR you can use the "--outReadsUnmapped" option to get the unmapped reads in a separate bam. I guess that you can get the original fastq by using both the "Aligned.out.sam" and the "unmapped.bam"

ADD REPLYlink written 8 months ago by Corentin260
2
gravatar for genomax
8 months ago by
genomax64k
United States
genomax64k wrote:

For #1 there is no way to predict size of the BAM files (file sizes are never a good measure of anything other than a run failure).

when I try to convert the bam file into a fasta using samtools is there a way to have the reference genome on the top and then the reads? how should the fasta file look like in order to then use it to annotate the genomes for example in Geneious?

Fasta format is a single sequence by definition (for what it is worth). What you are referring to would be an aligned fasta format file, which is not something samtools is going to produce. @Pierre has a tool called SAM4WebLogo that will produce aligned fasta files. I am mentioning it here but it is probably not what you have in mind.

There are other ways of calling a consensus sequence using bcftools which will give you a single fasta sequence. That could be used for annotating in downstream analysis.

ADD COMMENTlink modified 8 months ago • written 8 months ago by genomax64k

thanks a lot for your answer, really useful!

ADD REPLYlink written 8 months ago by ieie0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 764 users visited in the last hour