Question: Does The New Tophat2 - Bowtie2 Pipeline Really Map 100% Of The Rna-Seq Reads ? Is This Real ?
3
gravatar for biorepine
6.4 years ago by
biorepine1.4k
Spain
biorepine1.4k wrote:

Hi guys, I have mapped illumina paired end RNA-Seq data with latest tophat2 and bowtie2. I have 2 simple questions.

  1. All the time the bam files reports 100% mapping (samtools flagstats). Does tophat+bowtie pipeline really mapped 100% of the reads ?
  2. In one sample properly paired reads percentage is around 82% and in another 25%. Does it mean in the second case mapping is bad ?

1st case

177722025 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
177722025 + 0 mapped (**100.00%**:-nan%)
177722025 + 0 paired in sequencing
90536514 + 0 read1
87185511 + 0 read2
146263928 + 0 properly paired (**82.30%**:-nan%)
166719032 + 0 with itself and mate mapped
11002993 + 0 singletons (6.19%:-nan%)
8545608 + 0 with mate mapped to a different chr
383634 + 0 with mate mapped to a different chr (mapQ>=5)

2nd case

36171595 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
36171595 + 0 mapped (**100.00%**:-nan%)
36171595 + 0 paired in sequencing
18565905 + 0 read1
17605690 + 0 read2
9163074 + 0 properly paired (**25.33%**:-nan%)
29079258 + 0 with itself and mate mapped
7092337 + 0 singletons (19.61%:-nan%)
4041802 + 0 with mate mapped to a different chr
139464 + 0 with mate mapped to a different chr (mapQ>=5)
[gbogu@pizza bam]$ samtools flagstat es_rep3.bam
ADD COMMENTlink modified 4.2 years ago by Biostar ♦♦ 20 • written 6.4 years ago by biorepine1.4k
2

TopHat writes the unmapped reads in a separate BAM file.

ADD REPLYlink written 6.4 years ago by JC7.8k

Maybe the pipeline does not write alignments for unmapped reads?

ADD REPLYlink written 6.4 years ago by jts240

@Jared : When I counted the no.of lines in right and left end fastq files, I had different numbers. In 2nd case for example, 153181608 lines were there. Does it mean my mapping percentage is only 23% (36171595/153181608) ? Is this expected ? Do you have any idea about general mapping percentage of paired end RNA-seq reads ?

ADD REPLYlink written 6.4 years ago by biorepine1.4k

I understood that. Thanx. I just checked ENCODE BAM files as well. If I count the number of mapped reads in BAM file by the total number of reads in FASTQ and multiply with 100, I am only getting ~30% mapping for almost all the files. Is this how one calculates mapping percentage ?

ADD REPLYlink written 6.4 years ago by biorepine1.4k
3

You should be extremely cautious with samtools flagstats. It does count mapping loci and not mappable mates. If one single mate maps to multiple loci on the genome, samtools falgstat counts it multiple times. A simple way to get the numer of mappable mates would be 'samtools view -f 0x40 file.bam | cut -f1 | sort | uniq | wc -l' for the first mate and 'samtools view -f 0x80 file.bam | cut -f1 | sort | uniq | wc -l' for the second mate. Sum up the two numbers and you end up with the number of mappable mates. When dividing this number with the number of input mates, you get your percentage of interest. To get the number of input mates, you can use 'cat input_1.fastq | paste - - - - | wc -l' and multiply it with 2, because of the second mate. Getting the number of mapped paired-mates is not so trivial... therefore I would recommend writing a small script.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by David Langenberger8.8k

@David: You should have posted your comment as answer. thanx. What is the purpose of paste --- ? I can just use wc -l file.fastq ? Isn't the same ?

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by biorepine1.4k

Paste takes as many lines as you write the '-' character and returns them in one line (tab-sep). Basically you get one fastq-entry in one line, instead of four. Just try 'cat input.fastq | head' and 'cat input.fastq | paste - - - - | head' and see what I mean. When you do 'cat input.fastq | wc -l', you have to divide the number by four.

ADD REPLYlink written 6.4 years ago by David Langenberger8.8k

I use 'paste' a lot for modifying fastq-files in the command line. You don't need any counter to distinguish in what line you are (header, sequence, info, or quality), since they are just column 1-4 now.

ADD REPLYlink written 6.4 years ago by David Langenberger8.8k

what is the "new" Tophat2? O.,o

ADD REPLYlink written 5.6 years ago by cpcantalapiedra140

Doesn't align_summary.txt in the Tophat output file give this information? 

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by molla.linda60
2
gravatar for David Langenberger
6.4 years ago by
Deutschland
David Langenberger8.8k wrote:

Here again, as answer:

You should be extremely cautious with samtools flagstats. It does count mapping loci and not mappable mates. If one single mate maps to multiple loci on the genome, samtools falgstat counts it multiple times.

A simple way to get the numer of mappable mates would be

samtools view -f 0x40 file.bam | cut -f1 | sort | uniq | wc -l

for the first mate and

samtools view -f 0x80 file.bam | cut -f1 | sort | uniq | wc -l

for the second mate. Sum up the two numbers and you end up with the number of mappable mates. When dividing this number with the number of input mates, you get your percentage of interest.

To get the number of input mates, you can use

cat input_1.fastq | paste - - - - | wc -l

and multiply it with 2, because of the second mate.

Getting the number of mapped paired-mates is not so trivial... therefore I would recommend writing a small script.

ADD COMMENTlink written 6.4 years ago by David Langenberger8.8k

Could you please also add how to count mapped reads of single end reads as well ? thanx.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by biorepine1.4k
2

samtools view file.bam | cut -f1 | sort | uniq | wc -l

ADD REPLYlink written 6.4 years ago by David Langenberger8.8k

For both, single- and paired-end data, you should also add -F 0x4 to assure that you do not count unmapped reads. Some tools also write unmapped reads in the bam file. The calls above work, since bowtie/tophat do not write them, so they work correct.... but perhaps some people want to use the calls with other mapping algorithms.

ADD REPLYlink written 6.4 years ago by David Langenberger8.8k

point taken. thnx again

ADD REPLYlink written 6.4 years ago by biorepine1.4k

0x40 is not working for hi-seq files but 0x80 is. It is throwing erro even after providing 80GB RAM. error - "sort: write failed: /tmp/5379634.1.main/sort09BLpa: No space left on device"

ADD REPLYlink written 6.4 years ago by biorepine1.4k

Yeah, for huge bam-files, the sort-uniq command might be a problem. You can do a 'samtools sort -n' on your bam file first and then by using 'samtools view -f 0x80' on your sorted file, count unique ids in the first column. This way you only need memory for the counter. Got the idea?

ADD REPLYlink written 6.4 years ago by David Langenberger8.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 976 users visited in the last hour