For Capture And Pair-End Sequecing, How Many Fastq Reads Are Dumped During Mapping By Bwa?
3
1
Entering edit mode
10.2 years ago
PeterPan ▴ 30

hi, recently I am working on some targeted sequencing data. The captured region is about 2Mb of human genome. And it's sequenced by Illumina pair-end Hiseq. I use bwa to do mapping, with default parameter. But I found that half of my fastq reads are dumped after I got sam file outputed by bwa. And those sam file are raw sam file, which do not filtered by library size.

Does this mean many of my fastq reads are mapped to more than one genomic position and thus are dumped?

I am wondering whether the proportion of dumped reads are so high?

Thanks very much!

bwa sam filter illumina • 5.4k views
0
Entering edit mode

This is not a direct answer to your question, but you can do some basic counts using samtools

3. number of uniquely mapped reads: samtools view -S -F0x4 -c -q 1 reads.sam

0
Entering edit mode

Thanks gorge. I will try samtools to count reads. Actually, I counted fastq file and sam file by C. The fastq file line * 1/4 would be the fastq reads count, and the sam file line *1/2 would be the mapped reads count right?

0
Entering edit mode

Hi, George. I use gatk to dedup, indel realign and table recalibaration. After that, I summarize unique reads of bam file using samtools, it's about 118M. Does that sounds good? And, I found that although I used gatk to deduplicate, the output still has 10% of duplicates by samtools. So what's the problem here? Thanks

0
Entering edit mode

Also, typically bwa would output reads that mapped to multiple positions. When you do the mapping, do you use the whole human genome or just the part that you think you have captured? If so, is it possible some of the reads fall outside the 2Mb region?

0
Entering edit mode

I used the whole genome to do mapping. Yes, definitely some reads fall outside 2Mb reagion. Often we capture most of the region we want, and also some other region randomly maybe.

0
Entering edit mode

A capture efficiency of 40-60% is not unexpected, at least in my experience.

0
Entering edit mode

Thanks Davis! After filtering reads by library size and deduplicate, I got about an efficiency of 22%. Is this acceptable?

0
Entering edit mode

Can you post the samtools flagstats output?

0
Entering edit mode

Thanks Rm! I am trying samtools. I used C to count reads myself~

0
Entering edit mode

hi, I use gatk to dedup, indel realign and table recalibaration. After that, I summarize unique reads of bam file using samtools, it's about 118M.

2
Entering edit mode
10.2 years ago

bwa should not dump reads. It should put all the unmapped reads in the .bam, and it should not dump reads that align to more than one place either. bwa won't care whether your sample is capture or not, either.

What was your command line? Are you aligning to the whole genome, or just the captured region? You really should always align to the whole genome. You can filter out the reads that hit your targeted region afterwards with BEDTools.

And yes, 40-60% aligning to target is normal

0
Entering edit mode

Thanks swbarnes2! I use the whole genome to do mapping. and my command lines are: $bwadir/bwa aln$refdir/humang1kv37.fasta $fqdir/fwd/SLPE-93f.fq >$saif $bwadir/bwa aln$refdir/humang1kv37.fasta $rev >$sair $bwadir/bwa sampe$refdir/humang1kv37.fasta $saif$sair $fqdir/fwd/SLPE-93f.fq$rev > $sam ADD REPLY 2 Entering edit mode samtools was written to deal with sam files, it would be easier to use that than to reinvent the wheel in C. Samtools also naturally understands the flags. One thing, .sam files are huge, you really shouldn't make them for big projects. instead of >$sam, you should do :

| samtools view -bSh - > out.bam

Then use samtools flagstat and samtools idxstats to assess the stats. Use

samtools view chr4:1000-1000 out.bam > region.sam

If you really want to look at the raw sam file for a small region.

Also, bwa will take gzipped fastq files, so you might as well gzip your fastqs, if you aren't already compressing them.

0
Entering edit mode

That's really useful, swbarnes! I'll use fastq.gz from now on. It can save a lot of hard disk space. But I use sam file because I used C to filter sam file by library size. I can't handle bam file directly. So is there a direct way to filter bam files, example by samtools? Thanks.

0
Entering edit mode

Instead of opening the .sam file, pipe the output of samtools view into your script.

I use Perl, and I do "open IN, "samtools view -out.bam | " when I want to run a sam file through a script.

Or, could you use a one-liner in awk or grep instead, and keep it all in the command line? If you are just filtering by the value in the insert size column, I bet you could.

0
Entering edit mode

Thanks swbarnes. Acturally I had tried to pipe out by "samtools view *.bam |", but I got stuck there. My C code crashed when receiving samtools pipe out... Besides, I can't handle awk or perl very well.

2
Entering edit mode
10.2 years ago

It hasn't been mentioned, so I'll mention it for completeness:

samtools flagstat BAMFILENAME

0
Entering edit mode

Thanks Davis. Is "flagstat" a command in samtools? I haven't find it in samtools.1

0
Entering edit mode

I've tried. That's great! Thanks Davis! Samtools should not hide this useful command!

1
Entering edit mode
10.2 years ago
KCC ★ 4.0k

Thanks gorge. I will try samtools to count reads. Actually, I counted fastq file and sam file by C. The fastq file line * 1/4 would be the fastq reads count, and the sam file line *1/2 would be the mapped reads count right?

I have never done this kind of analysis. However, I have been using bwa for a year now. Although I know about samtools, I often use python or C++ to analyze my SAM files directly. It's just faster to count everything I want in one pass rather than running samtools many times.

If you use bwa then everything you need should be in the SAM file. There should be a line for all your reads both mapped and unmapped. You don't need the FASTQ file to figure out how many reads you have total or which ones are mapped.

It makes a lot of sense to just use the samtool commands that I suggested, just to take out the unknown factor of whether your C code is correct.

Note that to count which reads are mapped, you have to look at the second column of the SAM file, the number there should NOT have the third bit set. In C, this would be something like:

  if (!(flag&4)) mapped_count++;


If you are not doing something like this then your counts might not be right.

1
Entering edit mode

Thanks Gorge. I think I made a mistake in my C code. Using samtools view as you had suggested, I calculated map ratio. It's about 95%

0
Entering edit mode

@PeterPan: Do you know python? This is all a lot easier in python.

import sys

infile = open(sys.argv[1],'r')

total = 0
mapped =0

for line in infile:
if line[0] != "@":
data = line.strip().split('\t')
flag = int(data[1])

total+=1

if flag & 4:
continue

mapped+=1

print 100*float(mapped)/total,'% mapped'


Sorry, didn't have the stamina to write this in C.

Save this to a file. Lets say your file was named CountMapped.py, then you type "python CountMapped.py File.sam" where File.sam is the actual name of your SAM file. (This all assumes you have python on your computer.)

0
Entering edit mode

Thanks, George. What does "if flag & 4:" mean? "flag = int(data[1])" means the flag is the first element of a line?

0
Entering edit mode

@PeterPan: Python arrays start at zero just like in C. So, data[0] is the first and data[1] is the second element of the line. Four is 100 in binary, so flag & 4 is true if the third bit of flag is set and false otherwise. The third bit says whether the sequence is mapped or not. If set, it means unmapped. (Incidentally flag & 16 tells you which strand the read is mapped to.)

0
Entering edit mode

OK, got it. Thanks. I did not understand flag before, so I filted the 9th. column of sam file according to library size. Maybe I should use flag later.