Question

Explanation Of Low Mapping Quality!

1

Entering edit mode

12.2 years ago

khikho ▴ 100

This is my pipeline for the human reads to map with human reference: (I use default solid2fastq.pl script which is in BWA package)

#! /bin/bash -l
#index the reference
bwa index -a bwtsw human_g1k_v37.fasta &&
#aligne the file with reference
bwa aln human_g1k_v37.fasta Read.single.fastq > Read_out.sai &&
#Convert the alignment output to a SAM file
bwa samse human_g1k_v37.fasta Read_out.sai Read.single.fastq > Read_out.sam  &&
#extract the unmapped reads
awk '{if (substr($1,1,1)=="@") {print $0} else {if ($3!="*") {print $0}}}' Read_out.sam > out.sam
#convert to BAM file
samtools view -b -S Read_out.sam > Read_out.bam &&
#sort the BAM file
samtools sort Read_out.bam Read_out.bam.sorted &&
#index the BAM file
samtools index Read_out.bam.sorted.bam &&
#Delete redundant reads/duplicates
samtools rmdup Read_out.bam.sorted.bam Read_out_final.bam.sorted.bam &&
#computing Mapping Quality
samtools view Read_out_final.bam.sorted.bam -f human_g1k_v37.fasta | awk '{x+=$5}END{print x/NR}' > MAPQ.txt&&

But only 54,538 reads mapped with human reference and the mapped quality is 0.00140495 . I don't know what's wrong in pipeline to got this mapped quality.

Thanks for your explanation.

bwa next-gen sequencing genomics • 5.9k views

ADD COMMENT • link 12.2 years ago by khikho ▴ 100

score 1 · Answer 1 · 2012-02-07

1

Entering edit mode

12.2 years ago

Zev.Kronenberg 12k

I would start by finding the answers to these questions:

What is the average quality over length for your reads? try fastQC
What is the distribution of read lengths? try fastQC
Are you grabbing the correct SAM lines? You can use samtools view F and f flags to replace the awk.
Use samtools tview to look at your alignment. Do they make sense?

ADD COMMENT • link 12.2 years ago by Zev.Kronenberg 12k

0

Entering edit mode

This is fastqc output

ADD REPLY • link 12.2 years ago by khikho ▴ 100

score 1 · Answer 2 · 2012-02-07

I'd pipe the samse step into samtools view. There's no need to make a .sam when samtools has all the tools to deal with much smaller .bam.

What if you forget the awk steps, and just run samtools flagstat on the bam, prior to sorting and duplicate removal? That will tell you how many reads mapped, and how many did not. I'd also be skeptical of the helpfulness of rmdup on single end reads. It's going to trim a lot of things away that aren't PCR duplicates.

But yeah, if all else fails, eyeball part of your your fastq. Are the quality scores good? Eyeball part of the sam file. Maybe your alignments are very non-unique, or maybe your awk statement isnt' quite right.

score 0 · Answer 3 · 2012-02-09

0

Entering edit mode

12.2 years ago

khikho ▴ 100

@zev.kronenberg This is fastqc output.

[?][?][?] [?][?][?] [?][?][?]

And in the 1st awk command I grab 3rd column(Reference name) of the sam file which contain "*" instead of chromosome name and then count them by "wc -l" command and I think it should be correct, Shouldn't it?!

@swbarnes2 I completely agree with you about the helpfulness of rmdup but the number of mapped read which I said is before the rmdup step

ADD COMMENT • link 12.2 years ago by khikho ▴ 100

0

Entering edit mode

i dont know if this is characteristic of SOLiD reads but the quality scores seem really low to me (but i am using illumina). i remember reading somewhere that solid's colorspace quality was a bit diff. anyways i would recommend rerunning bwa without the && and instead using 2> and looking through the output

ADD REPLY • link 12.2 years ago by Ying W ★ 4.2k

0

Entering edit mode

You should also go back and check that the solid qualities were converted correctly. If they were and the fastQC report is correct then you have a rough dataset, which is going to give you a poor alignment. looking at the mean quality distribution you only have ~20,000 'good' reads. I hope this isn't the case. you are correct about the awk. I am lazy and use the flags.

You should also look at the alignment.

Take my advice with a grain of salt: I haven't worked with solid data.

GOOD LUCK!

ADD REPLY • link 12.2 years ago by Zev.Kronenberg 12k

0

Entering edit mode

As I read on Seqanswer direct conversion from Colorspace to sequence-space is wasteful (because One error messes up all the remaining basepairs) so Do you know the software which works with Colorspace without conversion!?

ADD REPLY • link 12.2 years ago by khikho ▴ 100