Question

Tophat : How Many Reads Are Aligned ?

0

Entering edit mode

11.9 years ago

Nicolas Rosewick 10k

Hi,

Simple question but I cannot figure how to do it : How many reads are aligned using tophat 2 ? with paired-end data.

Thanks,

N.

tophat read counts • 7.0k views

ADD COMMENT • link updated 10.3 years ago by Puriney ▴ 330 • written 11.9 years ago by Nicolas Rosewick 10k

0

Entering edit mode

in the output folder you should find the file align_summary.txt with this information

ADD REPLY • link 10.3 years ago by Martombo ★ 3.1k

score 5 · Answer 1 · 2012-05-29

5

Entering edit mode

11.9 years ago

David Langenberger 11k

If you want to distinguish between the left and the right mate of your reads, you can just count the mappings like this:

samtools view –f 0x40 –F 0x4 run.bam | cut -f1 | sort | uniq | wc -l (left mate)

and

samtools view –f 0x80 –F 0x4 run.bam | cut -f1 | sort | uniq  | wc -l (right mate)

-f 0x40 or -f 0x80 returns only the mappings of the left, or right mates. If the IDs for your left and right mates are the same, this is the only way to find out, which side you are counting.

-F 0x4 returns only mapped reads (most of the mapping tools return all reads, even they didn't find a mapping position).

ADD COMMENT • link 11.9 years ago by David Langenberger 11k

0

Entering edit mode

(all reads in the fastq file of the left mates) / (mapping left mates) = % of mapped left mates

(all reads in the fastq file of the right mates) / (mapping right mates) = % of mapped right mates

Remember: If only one mate mappes to the genome, it is called. Thus, it is important to distinguish the mates.

ADD REPLY • link 11.9 years ago by David Langenberger 11k

0

Entering edit mode

Hi David, samtools flagstat already provides this info. nevertheless, very useful.

ADD REPLY • link 11.9 years ago by Arun 2.4k

1

Entering edit mode

Well, if you allow multiple mappings of one read, samtools flagstat will count all x mappings and thus destroy your statistics! To make it correctly, you should sort and uniq your mapping IDs.

But if you do not allow multiple mappings, you are fine with flagstat. But then you loose all duplicated genes and run into problems with identical domains of some genes. That would then destroy your DiffExp analysis for some genes.

ADD REPLY • link 11.9 years ago by David Langenberger 11k

0

Entering edit mode

I have to agree, I assumed the bam file contains only unique reads. I have not known so far about a DE analysis performed on reads with multiple mapping. Curious, how do you decide which duplicated gene gets what read?

ADD REPLY • link 11.9 years ago by Arun 2.4k

0

Entering edit mode

;) You actually cannot decide where the read comes from. There are some ideas about using the context to make that decision... If there are in one gene no reads in the flanking of this read and the other there are reads, you can decide to choose the latter location as source of the read. But in the end... you don't know for sure.

ADD REPLY • link 11.9 years ago by David Langenberger 11k

0

Entering edit mode

:) So its a bias you essentially can't avoid. And since you are comparing the same entity for differential expression, be it genes or exons, you can safely assume the same amount of bias in other replicates and/or experimental conditions as well. In essence, given the demerits, and the fact that we compare same entities, I don't see the purpose of looking at multiple mapped reads.

ADD REPLY • link 11.9 years ago by Arun 2.4k

0

Entering edit mode

As long, as you are not interested in duplicated genes, or small ncRNAs, or repeat associated regions, you are fine with unique hits. It completely depends on your analysis and point of view. I think it is very important to keep them, since it gives you information... even you don't know how to handle them correctly... for now. But think about e.g. zinc finger proteins

ADD REPLY • link 11.9 years ago by David Langenberger 11k

0

Entering edit mode

I get your point and that at the end, it is very opinionated. However, read sizes obtained from sequencers will continue to increase and there will be less ambiguity over repetitive regions, if any, I believe (my opinion :D ). Already, for example, bowtie uses BWT for seed and extension using dynamic programming. Nevertheless, always good to know other perspectives and keep them in mind! Appreciate the conversation! :)

ADD REPLY • link 11.9 years ago by Arun 2.4k

score 1 · Answer 2 · 2012-05-29

1

Entering edit mode

11.9 years ago

Arun 2.4k

A very simple statistics on your reads that might be what you are looking for is samtools flagstat

ADD COMMENT • link 11.9 years ago by Arun 2.4k

score 0 · Answer 3 · 2014-01-16

#!/usr/bin/perl -w 
use strict; 
my $bam = shift; 
my $tmp = "/tmp/tmpNameList"; 
use List::Util qw(sum); 
`samtools view $bam | cut -f1 > $tmp`;
`wc -l $tmp`; 
open NAMES, $tmp; 
my %hash; 
while (<NAMES>) {
    chomp; 
    my $id = $_; 
    $hash{$id} = 1; 
}

my @values = values %hash; 
my $readsNum = sum @values; 
print "File: $bam\n"; 
print "Sum:  $readsNum\n";

Maybe you could just use the hash of perl, If the sort | uniq command is source consuming. Treat the id as the key, sign its corresponding value as 1. Iteratively reading the id list, the same id still get the same value in light of multiple alignment would yield many same ids. After the hash construction, simply sum up all the value would give you how many reads actually in bam.

I think samtools -c command would yield the alignment number, which is often confused with reads number. Once you allow multiple alignment during your reads alignment, the number by samtools -c would be greatly larger than the real reads number in a bam file.

Even the pari-end alignement file could be counted with this easy perl script, as long as the id of each pair are same, just like the output of STAR. In other words, in the fastq, you would see abcdefg/1 for read1, abcdef/2 for read2 for example, in STAR the /1 and /2 are erased, leaving abcdefg for the read id.