Does The New Tophat2 - Bowtie2 Pipeline Really Map 100% Of The Rna-Seq Reads ? Is This Real ?
1
3
Entering edit mode
9.4 years ago
biorepine ★ 1.5k

Hi guys, I have mapped illumina paired end RNA-Seq data with latest tophat2 and bowtie2. I have 2 simple questions.

1. All the time the bam files reports 100% mapping (samtools flagstats). Does tophat+bowtie pipeline really mapped 100% of the reads ?
2. In one sample properly paired reads percentage is around 82% and in another 25%. Does it mean in the second case mapping is bad ?

1st case

177722025 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
177722025 + 0 mapped (**100.00%**:-nan%)
177722025 + 0 paired in sequencing
146263928 + 0 properly paired (**82.30%**:-nan%)
166719032 + 0 with itself and mate mapped
11002993 + 0 singletons (6.19%:-nan%)
8545608 + 0 with mate mapped to a different chr
383634 + 0 with mate mapped to a different chr (mapQ>=5)


2nd case

36171595 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
36171595 + 0 mapped (**100.00%**:-nan%)
36171595 + 0 paired in sequencing
9163074 + 0 properly paired (**25.33%**:-nan%)
29079258 + 0 with itself and mate mapped
7092337 + 0 singletons (19.61%:-nan%)
4041802 + 0 with mate mapped to a different chr
139464 + 0 with mate mapped to a different chr (mapQ>=5)
[gbogu@pizza bam]\$ samtools flagstat es_rep3.bam

rna-seq samtools mapping tophat2 bowtie2 • 5.8k views
2
Entering edit mode

TopHat writes the unmapped reads in a separate BAM file.

0
Entering edit mode

Maybe the pipeline does not write alignments for unmapped reads?

0
Entering edit mode

@Jared : When I counted the no.of lines in right and left end fastq files, I had different numbers. In 2nd case for example, 153181608 lines were there. Does it mean my mapping percentage is only 23% (36171595/153181608) ? Is this expected ? Do you have any idea about general mapping percentage of paired end RNA-seq reads ?

0
Entering edit mode

I understood that. Thanx. I just checked ENCODE BAM files as well. If I count the number of mapped reads in BAM file by the total number of reads in FASTQ and multiply with 100, I am only getting ~30% mapping for almost all the files. Is this how one calculates mapping percentage ?

3
Entering edit mode

You should be extremely cautious with samtools flagstats. It does count mapping loci and not mappable mates. If one single mate maps to multiple loci on the genome, samtools falgstat counts it multiple times. A simple way to get the numer of mappable mates would be 'samtools view -f 0x40 file.bam | cut -f1 | sort | uniq | wc -l' for the first mate and 'samtools view -f 0x80 file.bam | cut -f1 | sort | uniq | wc -l' for the second mate. Sum up the two numbers and you end up with the number of mappable mates. When dividing this number with the number of input mates, you get your percentage of interest. To get the number of input mates, you can use 'cat input_1.fastq | paste - - - - | wc -l' and multiply it with 2, because of the second mate. Getting the number of mapped paired-mates is not so trivial... therefore I would recommend writing a small script.

0
Entering edit mode

@David: You should have posted your comment as answer. thanx. What is the purpose of paste --- ? I can just use wc -l file.fastq ? Isn't the same ?

0
Entering edit mode

Paste takes as many lines as you write the '-' character and returns them in one line (tab-sep). Basically you get one fastq-entry in one line, instead of four. Just try 'cat input.fastq | head' and 'cat input.fastq | paste - - - - | head' and see what I mean. When you do 'cat input.fastq | wc -l', you have to divide the number by four.

0
Entering edit mode

I use 'paste' a lot for modifying fastq-files in the command line. You don't need any counter to distinguish in what line you are (header, sequence, info, or quality), since they are just column 1-4 now.

0
Entering edit mode

what is the "new" Tophat2? O.,o

0
Entering edit mode

Doesn't align_summary.txt in the Tophat output file give this information?

2
Entering edit mode
9.4 years ago

You should be extremely cautious with samtools flagstats. It does count mapping loci and not mappable mates. If one single mate maps to multiple loci on the genome, samtools falgstat counts it multiple times.

A simple way to get the numer of mappable mates would be

samtools view -f 0x40 file.bam | cut -f1 | sort | uniq | wc -l


for the first mate and

samtools view -f 0x80 file.bam | cut -f1 | sort | uniq | wc -l


for the second mate. Sum up the two numbers and you end up with the number of mappable mates. When dividing this number with the number of input mates, you get your percentage of interest.

To get the number of input mates, you can use

cat input_1.fastq | paste - - - - | wc -l


and multiply it with 2, because of the second mate.

Getting the number of mapped paired-mates is not so trivial... therefore I would recommend writing a small script.

0
Entering edit mode

2
Entering edit mode

samtools view file.bam | cut -f1 | sort | uniq | wc -l

0
Entering edit mode

For both, single- and paired-end data, you should also add -F 0x4 to assure that you do not count unmapped reads. Some tools also write unmapped reads in the bam file. The calls above work, since bowtie/tophat do not write them, so they work correct.... but perhaps some people want to use the calls with other mapping algorithms.

0
Entering edit mode

point taken. thnx again

0
Entering edit mode

0x40 is not working for hi-seq files but 0x80 is. It is throwing erro even after providing 80GB RAM. error - "sort: write failed: /tmp/5379634.1.main/sort09BLpa: No space left on device"

0
Entering edit mode

Yeah, for huge bam-files, the sort-uniq command might be a problem. You can do a 'samtools sort -n' on your bam file first and then by using 'samtools view -f 0x80' on your sorted file, count unique ids in the first column. This way you only need memory for the counter. Got the idea?