Question

bwa-mem, from .sam/.bam to fasta file

0

Entering edit mode

5.8 years ago

ieie ▴ 10

Dear all, I am trying to do read mapping to construct cp genomes using a reference genome and performing the mapping with bwa-mem.

I am running this command bwa mem -M -B 4 $REF $R1 $R2 > output.bam

-M to mark shorter split as secondary and -B 4 for mismatch penalty. I have two questions:

1)after I have run bwa and my reference genome fasta file is 160kb and and my fastq files are 2.70 GB should I get a bam file of 5 GB or should it be smaller?

2) when I try to convert the bam file into a fasta using samtools is there a way to have the reference genome on the top and then the reads? how should the fasta file look like in order to then use it to annotate the genomes for example in Geneious?

I hope my question are not too vague or basic. thanks a lot

bwa samtools • 4.0k views

ADD COMMENT • link 5.8 years ago by ieie ▴ 10

1

Entering edit mode

The size of the bam depends on how many reads mapped to the reference.

Also, be aware that bwa mem outputs a SAM file by default, not a BAM.

ADD REPLY • link 5.8 years ago by Corentin ▴ 600

0

Entering edit mode

The size of the bam depends on how many reads mapped to the reference

That is not true. All reads will be at the SAM/BAM file, they simply will be flagged as unmapped. Still, default (and only) output of BWA mem is SAM not BAM. Simply count reads in one of the fastq files (wc -l fastq.fastq), divide this by 4, and compare this with samtools flagstat output.sam

ADD REPLY • link 5.8 years ago by ATpoint 81k

0

Entering edit mode

thanks for your answer! I guess my main issue is that after obtaining the output.sam files of my series of reads.fastq I would like to end with a fasta multi alignment file containing the constructed cp genomes and I have looked for a pipeline to do that but with no useful solutions. so I after performing the bwa-mem I can't picture the further steps.

ADD REPLY • link 5.8 years ago by ieie ▴ 10

1

Entering edit mode

Once you obtain the alignments (do the SAM/BAM conversion etc.) you will need to call variants and then generate the consensus. This process is described on this page.

That said if you have access to Geneious licenses then why don't you do this entire thing in Geneious. Normally people won't recommend commercial products but if you are new to command line it may be much easier to get this done in Geneious.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

hi I am doing the sorting of the bam file through:

samtools sort 103308_1116_S42_L007_R2_001.bam > 103308_1116_S42_L007_R2_001sorted.bam

but I keep getting this output:

head 103308_1116_S42_L007_R2_001sorted.bam

?BC?EN? ?0???~b????i?????%?:f?R? A??O6{??p??8?&?>?S?Sl?r??ԭ??X?~?0?v????/?u?_96???BCB=?}?d?uW?xw]?K?jʗ?o???뭊g???RTO?]?3f?t??C??c'?(xE?'$??tHA???D?????? EB|#"!?#AP??E> 8?s????V??[]?w?^????w?s?_~?^??^?????Q?7??????hDe6??r???G?gg??:??uP?ʸLfQ??I?G_5????B?wF?_?׫?jr???|????5?,????x?o??V????z?Z-????w?????xy???͏?????uW??G9??q|OM[/_??>??n??.?????G???'?0???^???/??7??s?>y??W?x|?fI???>?Qi? N??a??A?????´ ??}??|? z??w4???=??Fa9 ?(???4L{o??? ??w?4?'g?&L??B??P??C?du??P?G[???D???:e????q?2u.5u?@?o????qףu???R?;I?G?%? ????f?+?)?S?qp?^:??8?ԡwj?}??|GM]?-:dS?)N?????#??@T)k?$|d~????O[?~?????u?5?gi??"?^b?? *?M?Dy@?fT???9>7G???=?pr?H?3?n ??? S?O?÷x?C?2???>A????F?<l r????5y?%??4zr??="" ?^??????,??t?gyth??????jЯe?="" ???="" anybody="" knows="" why?="" i="" though="" you="" need="" to="" sort="" before="" doing="" the="" variant="" calling="" and="" getting="" the="" consensus="" sequence="" but="" this="" look="" really="" weird.<="" p="">

ADD REPLY • link 5.8 years ago by ieie ▴ 10

1

Entering edit mode

Of course. BAM is a binary (compressed) version of the SAM format. The default output of BWA is SAM.

bwa mem (options...) > out.sam
samtools sort out.sam -o out_sorted.bam

Or more elegant:

bwa mem (...) | samtools sort -o out_sorted.bam

ADD REPLY • link 5.8 years ago by ATpoint 81k

0

Entering edit mode

Right! I run the bwa mem and I got the SAM file, then I thought to convert it in BAM before sorting, thanks a lot!

ADD REPLY • link 5.8 years ago by ieie ▴ 10

1

Entering edit mode

Also, if you wish to visualise the content of a BAM file you can use "samtools view out.bam"

ADD REPLY • link 5.8 years ago by Corentin ▴ 600

0

Entering edit mode

about variant calling, after sorting I followed the pipeline of the page you gave me and I get a (consensus sequence with the reference genome header, don't know why) and when in the end I run :

cat ~/DNA-Mark/Guibourtia_leonensis_Cp.fas | bcftools consensus calls.vcf.gz > consensus.fa

I get this messages for different sites The site MG564755.1:122 overlaps with another variant, skipping...do you know whether is it an error and if yes what kind of error might be?

ADD REPLY • link 5.8 years ago by ieie ▴ 10

0

Entering edit mode

Yes, you're right (I was, wrongly, thinking about STAR, which produces by default a sam which contains only the aligned read)

ADD REPLY • link 5.8 years ago by Corentin ▴ 600

0

Entering edit mode

Really, STAR does that? Ok good to know, that means I will never use aligned BAM files from STAR to get back the original reads in fastq format. Thanks for letting me know :-D

ADD REPLY • link 5.8 years ago by ATpoint 81k

0

Entering edit mode

With STAR you can use the "--outReadsUnmapped" option to get the unmapped reads in a separate bam. I guess that you can get the original fastq by using both the "Aligned.out.sam" and the "unmapped.bam"

ADD REPLY • link 5.8 years ago by Corentin ▴ 600

score 2 · Answer 1 · 2018-07-18

For #1 there is no way to predict size of the BAM files (file sizes are never a good measure of anything other than a run failure).

when I try to convert the bam file into a fasta using samtools is there a way to have the reference genome on the top and then the reads? how should the fasta file look like in order to then use it to annotate the genomes for example in Geneious?

Fasta format is a single sequence by definition (for what it is worth). What you are referring to would be an aligned fasta format file, which is not something samtools is going to produce. @Pierre has a tool called SAM4WebLogo that will produce aligned fasta files. I am mentioning it here but it is probably not what you have in mind.

There are other ways of calling a consensus sequence using bcftools which will give you a single fasta sequence. That could be used for annotating in downstream analysis.