Alignment with BWA and SAMtools
0
0
Entering edit mode
4.6 years ago
Shelle ▴ 30

I am new to BWA and samtools and would appreciate any help regarding my question.

With BWA i did alignment for a fastq file along with the reference. The output of BWA was given to SAMtools to process with commands like view, sort, index and again view to see the sam file at the end. reviewing sam file at the end i noticed most of values in the "POS" field has zero value meaning so many matches was happening at zero location. I am wondering what would be the cause of this behavior? it seems very strange to me...

thanks in advance for any help

BWA SAM-file SAMtools • 12k views
1
Entering edit mode

Hello Shelle,

please post all commands you have used. Also the output of samtools flagstat your.bam will be useful.

Thanks.

fin swimmer

1
Entering edit mode

Hi finswimmer

bwa index ref.fa

samtools view -S -b .sam -o .bam
samtools sort  .bam  sortedbam
samtools index  sortedbam.bam
samtools view  sortedbam.bam  -o  sorted.sam


I looked at the "POS" field of "sorted.sam" file and noticed most of their value is zero. And i am still wondering why?

1
Entering edit mode

A filename with just an extension and no name is a little bit strange but might work. (Under linux this files would be hidden)

Yes, the output of samtools flagstat sorted.bam please.

BTW:

bwa mem ref.fa read1.fq read2.fq > .sam
samtools view -S -b .sam -o .bam


You can do it like this:

bwa mem ref.fa read1.fq read2.fq | samtools view -o output.bam -


This streams the output from bwa directly to samtools. Never version of samtools recognize the fileextension when using -o and convert to bam if necessary.

If you have enough cores you can even directly sort and and convert to bam:

bwa mem ref.fa read1.fq read2.fq | samtools sort -o sortedbam.bam -


Furthermore there is no need to do this, if you just want to have a look what's in the bam file:

samtools view  sortedbam.bam  -o  sorted.sam


samtools view -h sortedbam.bam|less


Then you can scroll through the output.

0
Entering edit mode

Ok so thanks for your comment. I have used the whole name for each file in bash. I just summarized it here. Sorry for the confusion.

I just did this samtools view sortedbam.bam -o sorted.sam since i wanted to work with sam file later on.

And sorry i haven't used this command samtools flagstat sorted.bam before. So I'm not so familiar with what type of information it might provide and how useful those info can be. I now applied it to output of file that i got from samtools sort and i obtained the below info:

903224 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
36678 + 0 mapped (4.06%:-nan%)
903224 + 0 paired in sequencing
34198 + 0 properly paired (3.79%:-nan%)
34380 + 0 with itself and mate mapped
2298 + 0 singletons (0.25%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

0
Entering edit mode
36678 + 0 mapped (4.06%:-nan%)


So question is now: Why?

How was the data generated? Did you already have done some quality checking of your data, e.g. using fastqc? Did you take the correct reference genome for mapping? ...

fin swimmer

0
Entering edit mode

Thanks for raising some questions. I will go back and answer these questions. But my initial question was why i have so many matches at zero locations? i can not find any relation between 4% mapped reads and so many matches at zero location. Maybe i have lack of knowledge that's why i am asking!

1
Entering edit mode

i can not find any relation between 4% mapped reads and so many matches at zero location.

You have so many matches at zero location because these reads couldn't properly mapped. But to get sure please post some of these sam lines.

0
Entering edit mode

Hi finswimmer

Below is part of the very last sam file that i am getting. I can't post all of it as it is a huge file.

SOLEXA3_0008:5:100:15178:17570  73      NZ_PNFW01000001.1       1       60      20S80M  =       1       0       CAGAAATAGATCAAGAGCCTAAAAAGAAAAGAGCTGGAAGAAAATATATTCCGCCAATGTCACATCCGTGGAAACGTGAGTCATTTTTAAGGCAACAAGC    GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFDGGGGFGGGGGGFGGGGGGGFGGEGFFGGFEDGGGEGEEGEEFCFDFCECEEGGDEBEAGEEEEAEE    NM:i:0  MD:Z:80 AS:i:80 XS:i:0
SOLEXA3_0008:5:100:15178:17570  133     NZ_PNFW01000001.1       1       0       *       =       1       0       CAAATCTTTTAACTCATCAGTTAACGCATCTAGTTCTGCAATCTGGTGATTAAAATATTTATTAAGAATAACAACTGTAGCAATTGCAGCAATTATAGCA    GGGGGGGGGGFGGGFGGGGGGGEGGGGGGGGGGGFGGGGGDDGGGEEFGGGGGFGADGDGGGEGFGGGGGGGGGGGEEGFGFDBFDGFEGFG=FGFGADG    AS:i:0  XS:i:0
SOLEXA3_0008:5:104:19719:9322   99      NZ_PNFW01000001.1       1       46      80S20M  =       86      178     CTACAGTAGATGATCATGTTTATTTACTCAAAGAGATTCCTAAAAATGCAAAAATATCTCCAGAAATAGATCAAGAGCCTAAAAAGAAAAGAGCTGGAAG    .>7@@-CCC-DA=DB.>=?>?C6;<:ACAA95<<-=ACCCBD?55>A?:BD=BDD@<D:D-6>;6:B=>:23:9=?7:)=>=4+=A<=:A@?BA5C>A-C    NM:i:0  MD:Z:20 AS:i:20 XS:i:0
SOLEXA3_0008:5:107:19550:13828  69      NZ_PNFW01000001.1       1       0       *       =       1       0       TGCGGTATATGGCAACTTTCTAGACTTTATTCTATTATAATTTGAATAGTGCAGGTCAGAAATTTTGCGAACCTAATTCATAACAAAGAGGTGTCCTTTA    ACCC?BEDEEBEBBEFFFFFE?AC:F:FFDFE?BFEEEEEFEEF=EEEBB=DDDBB=EDEFFFFFAFFDE:FEFF?FAF:EB?C?BB==4@2A@CBBECB    AS:i:0  XS:i:0
SOLEXA3_0008:5:107:19550:13828  137     NZ_PNFW01000001.1       1       60      25S75M  =       1       0       ATCTCCAGAAATAGATCAAGAGCCTAAAAAGAAAAGAGCTGGAAGAAAATATATTCCGCCAATGTCACATCCGTGGAAACGTGAGTCATTTTTAAGGCAC    ?BEDEDFAFFFFBAADF=EEF=FDFFEDFFDCCDDCAF=AFFDFFFCEFDEFFABBC?CEDDFFE?EEDEFBAAFFBAEBFBDBFDDFFFBF3DE5-:@%    NM:i:1  MD:Z:74A0       AS:i:74 XS:i:0
SOLEXA3_0008:5:109:6612:16424   99      NZ_PNFW01000001.1       1       60      65S35M  =       65      164     ATGTTTATTTACTCAAAGAGATTCCTAAAAATGCAAAAATATCTCCAGAAATAGGTCAAGAGCCTAAAAAGAAAAGAGCTGGCAGAAAATATATTCCGCC    =66,/2730+.763<:952/929;A99899A?A5A041>5*A;;=2A:</380A+?%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%    NM:i:1  MD:Z:17A17      AS:i:30 XS:i:0
SOLEXA3_0008:5:110:5832:7148    163     NZ_PNFW01000001.1       1       60      74S26M  =       129     176     TAGATGATCATGTTTATTTACTCAAAGAGATTCCTAAAAATGCAAAAATATCTCCAGAAATAGATCAAGAGCCTAAAAAGAAAAGAGCTGGAAGAAAATA    GDGGFGGGEGGGGGGFGGGGGGGGGFFGGGGGGGGGFGGGFGGEGGGGGGGGGGGFEGG?GFGGFGDBGGGDGDGGEFFGAGDGGFEEFFFGGAEF?CGG    NM:i:0  MD:Z:26 AS:i:26 XS:i:0
SOLEXA3_0008:5:111:5580:16198   99      NZ_PNFW01000001.1       1       60      26S74M  =       72      171     TATCTCCAGAAATAGATCAAGAGCCTAAAAAGAAAAGAGCTGGAAGAAAATATATTCCGCCAATGTCACATCCGTGGAAACGTGAGTCATTTTTAAGGCA    GGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGFGGGDGFGGGGGGGGDEGGDGGGFFDFDGGGEFGGBGF=EFFF?EECAB=BEFEFDEE?EF    NM:i:0  MD:Z:74 AS:i:74 XS:i:0

0
Entering edit mode

Hello Shelle,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

In this example you have just two reads that are unmapped (the second an the 4th). You can identify them by the flags in column two. Also there mapping quality (column 4) are set to 0. bwa set there RNAME (column 3) to the same as there mate and POS (column 4) to 1.

What is weird is that even those read with very good mapping quality are mapped to the same position and all startes at POS 1. But they have a high number of soft clipped bases and doesn't look similar. I guess there is something wrong with your reference. Are there multiple sequences in the reference? Do they accidentally have all the same name?

fin swimmer

0
Entering edit mode

Hi finswimmer

The ref was multi sequence but i did separate the largest strain and ran bwa and got the sam file at the final stage which i also shared with you. i also checked the name of sequences and their name is different. I am putting two of them randomly in here:

>NZ_PNFW01000001.1 Lactobacillus gasseri strain UMB0045b .16933_8_18.1, whole genome shotgun sequence

>NZ_PNFW01000010.1 Lactobacillus gasseri strain UMB0045b .16933_8_18.10, whole genome shotgun sequence


Not sure what causes lots of problems in terms of alignment and sam file.

0
Entering edit mode

Hello Shelle,

the name of the sequence is only everything until this first whitespace. The part after is handled as an description. So the two header you are showing, have the same sequence name and a different description. I guess this is a problem. Rename the names to something unique and try again.

fin swimmer

0
Entering edit mode

Hi finswimmer,

sorry i really didn't get what you said. "NZ_PNFW01000001.1" and "NZ_PNFW01000010.1" are just different. Plus As i mentioned before, this problem still exists assuming i separated the largest strain for the reference so it is not multi sequence at all and there is only one sequence in the ref file.

0
Entering edit mode

Oops, sorry. My mistake. I've missread it. To less coffe this morning.

0
Entering edit mode

Since your file is sorted, the alignments at the beginning will all start at the first base (and they'll tend to be soft-clipped due to your genome being circular). Have a look further down in the BAM file or in IGV and you'll likely see reads elsewhere as well.

0
Entering edit mode

Thanks for comment Devon Ryan. I am able to see reads are mapped elsewhere. But my initial question is why so matches happening at zero location in the final sam file? it is a very strange behavior!

0
Entering edit mode

Hello Shello,

now we are back to the questions I asked at the very beginning. And: What kind of sample is it?

0
Entering edit mode

Sorry to get into the post, but this is actually a problem that I am also facing. If the quality checking was done and reference is correct (double-checked), is there a way to increase the percentage of reads that get mapped?

0
Entering edit mode

Hello ieie,

I think it is better if you open your own thread. Otherwise it's getting confusing here.

fin swimmer