Entering edit mode
4.8 years ago
Shelle ▴ 30
I am new to BWA and samtools and would appreciate any help regarding my question.
With BWA i did alignment for a fastq file along with the reference. The output of BWA was given to SAMtools to process with commands like view, sort, index and again view to see the sam file at the end. reviewing sam file at the end i noticed most of values in the "POS" field has zero value meaning so many matches was happening at zero location. I am wondering what would be the cause of this behavior? it seems very strange to me...
thanks in advance for any help
please post all commands you have used. Also the output of
samtools flagstat your.bamwill be useful.
I looked at the "POS" field of "sorted.sam" file and noticed most of their value is zero. And i am still wondering why?
let me know if you need more info
Thanks for adding this information.
A filename with just an extension and no name is a little bit strange but might work. (Under linux this files would be hidden)
Yes, the output of
samtools flagstat sorted.bamplease.
You can do it like this:
This streams the output from
samtools. Never version of
samtoolsrecognize the fileextension when using
-oand convert to
If you have enough cores you can even directly sort and and convert to bam:
Furthermore there is no need to do this, if you just want to have a look what's in the
Do this instead:
Then you can scroll through the output.
Ok so thanks for your comment. I have used the whole name for each file in bash. I just summarized it here. Sorry for the confusion.
I just did this
samtools view sortedbam.bam -o sorted.samsince i wanted to work with sam file later on.
And sorry i haven't used this command
samtools flagstat sorted.bambefore. So I'm not so familiar with what type of information it might provide and how useful those info can be. I now applied it to output of file that i got from
samtools sortand i obtained the below info:
Here you have your problem. Just 4% out of your reads could be mapped.
So question is now: Why?
How was the data generated? Did you already have done some quality checking of your data, e.g. using fastqc? Did you take the correct reference genome for mapping? ...
Thanks for raising some questions. I will go back and answer these questions. But my initial question was why i have so many matches at zero locations? i can not find any relation between 4% mapped reads and so many matches at zero location. Maybe i have lack of knowledge that's why i am asking!
You have so many matches at zero location because these reads couldn't properly mapped. But to get sure please post some of these sam lines.
Below is part of the very last sam file that i am getting. I can't post all of it as it is a huge file.
Please use the formatting bar (especially the
codeoption) to present your post better. I've done it for you this time.
In this example you have just two reads that are unmapped (the second an the 4th). You can identify them by the flags in column two. Also there mapping quality (column 4) are set to 0.
RNAME(column 3) to the same as there mate and
POS(column 4) to 1.
What is weird is that even those read with very good mapping quality are mapped to the same position and all startes at POS 1. But they have a high number of soft clipped bases and doesn't look similar. I guess there is something wrong with your reference. Are there multiple sequences in the reference? Do they accidentally have all the same name?
The ref was multi sequence but i did separate the largest strain and ran bwa and got the sam file at the final stage which i also shared with you. i also checked the name of sequences and their name is different. I am putting two of them randomly in here:
Not sure what causes lots of problems in terms of alignment and sam file.
the name of the sequence is only everything until this first whitespace. The part after is handled as an description.
So the two header you are showing, have the same sequence name and a different description. I guess this is a problem. Rename the names to something unique and try again.
sorry i really didn't get what you said. "NZ_PNFW01000001.1" and "NZ_PNFW01000010.1" are just different. Plus As i mentioned before, this problem still exists assuming i separated the largest strain for the reference so it is not multi sequence at all and there is only one sequence in the ref file.
Oops, sorry. My mistake. I've missread it. To less coffe this morning.
Since your file is sorted, the alignments at the beginning will all start at the first base (and they'll tend to be soft-clipped due to your genome being circular). Have a look further down in the BAM file or in IGV and you'll likely see reads elsewhere as well.
Thanks for comment Devon Ryan. I am able to see reads are mapped elsewhere. But my initial question is why so matches happening at zero location in the final sam file? it is a very strange behavior!
now we are back to the questions I asked at the very beginning. And: What kind of sample is it?
Sorry to get into the post, but this is actually a problem that I am also facing. If the quality checking was done and reference is correct (double-checked), is there a way to increase the percentage of reads that get mapped?
I think it is better if you open your own thread. Otherwise it's getting confusing here.