Question: Explanation Of Low Mapping Quality!
gravatar for khikho
7.5 years ago by
khikho100 wrote:

This is my pipeline for the human reads to map with human reference: (I use default script which is in BWA package)

#! /bin/bash -l
#index the reference
bwa index -a bwtsw human_g1k_v37.fasta &&
#aligne the file with reference
bwa aln human_g1k_v37.fasta Read.single.fastq > Read_out.sai &&
#Convert the alignment output to a SAM file
bwa samse human_g1k_v37.fasta Read_out.sai Read.single.fastq > Read_out.sam  &&
#extract the unmapped reads
awk '{if (substr($1,1,1)=="@") {print $0} else {if ($3!="*") {print $0}}}' Read_out.sam > out.sam
#convert to BAM file
samtools view -b -S Read_out.sam > Read_out.bam &&
#sort the BAM file
samtools sort Read_out.bam Read_out.bam.sorted &&
#index the BAM file
samtools index Read_out.bam.sorted.bam &&
#Delete redundant reads/duplicates
samtools rmdup Read_out.bam.sorted.bam Read_out_final.bam.sorted.bam &&
#computing Mapping Quality
samtools view Read_out_final.bam.sorted.bam -f human_g1k_v37.fasta | awk '{x+=$5}END{print x/NR}' > MAPQ.txt&&

But only 54,538 reads mapped with human reference and the mapped quality is 0.00140495 . I don't know what's wrong in pipeline to got this mapped quality.

Thanks for your explanation.

next-gen bwa genomics sequencing • 3.8k views
ADD COMMENTlink written 7.5 years ago by khikho100
gravatar for Zev.Kronenberg
7.5 years ago by
United States
Zev.Kronenberg11k wrote:

I would start by finding the answers to these questions:

  • What is the average quality over length for your reads? try fastQC

  • What is the distribution of read lengths? try fastQC

  • Are you grabbing the correct SAM lines? You can use samtools view F and f flags to replace the awk.

  • Use samtools tview to look at your alignment. Do they make sense?

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Zev.Kronenberg11k

This is fastqc output

ADD REPLYlink written 7.5 years ago by khikho100
gravatar for Swbarnes2
7.5 years ago by
Swbarnes21.5k wrote:

I'd pipe the samse step into samtools view. There's no need to make a .sam when samtools has all the tools to deal with much smaller .bam.

What if you forget the awk steps, and just run samtools flagstat on the bam, prior to sorting and duplicate removal? That will tell you how many reads mapped, and how many did not. I'd also be skeptical of the helpfulness of rmdup on single end reads. It's going to trim a lot of things away that aren't PCR duplicates.

But yeah, if all else fails, eyeball part of your your fastq. Are the quality scores good? Eyeball part of the sam file. Maybe your alignments are very non-unique, or maybe your awk statement isnt' quite right.

ADD COMMENTlink written 7.5 years ago by Swbarnes21.5k
gravatar for khikho
7.5 years ago by
khikho100 wrote:

@zev.kronenberg This is fastqc output.

[?][?][?] [?][?][?] [?][?][?]

And in the 1st awk command I grab 3rd column(Reference name) of the sam file which contain "*" instead of chromosome name and then count them by "wc -l" command and I think it should be correct, Shouldn't it?!

@swbarnes2 I completely agree with you about the helpfulness of rmdup but the number of mapped read which I said is before the rmdup step

ADD COMMENTlink written 7.5 years ago by khikho100

i dont know if this is characteristic of SOLiD reads but the quality scores seem really low to me (but i am using illumina). i remember reading somewhere that solid's colorspace quality was a bit diff. anyways i would recommend rerunning bwa without the && and instead using 2> and looking through the output

ADD REPLYlink written 7.5 years ago by Ying W3.9k

You should also go back and check that the solid qualities were converted correctly. If they were and the fastQC report is correct then you have a rough dataset, which is going to give you a poor alignment. looking at the mean quality distribution you only have ~20,000 'good' reads. I hope this isn't the case. you are correct about the awk. I am lazy and use the flags.

You should also look at the alignment.

Take my advice with a grain of salt: I haven't worked with solid data.


ADD REPLYlink written 7.5 years ago by Zev.Kronenberg11k

As I read on Seqanswer direct conversion from Colorspace to sequence-space is wasteful (because One error messes up all the remaining basepairs) so Do you know the software which works with Colorspace without conversion!?

ADD REPLYlink written 7.5 years ago by khikho100
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 756 users visited in the last hour