Question: Query Regarding Rna-Seq Hit Count
3
gravatar for yvs.0815
6.6 years ago by
yvs.081530
yvs.081530 wrote:

Hi everyone!

I am running an RNA-Seq experiment and was wondering if anyone could help me with a particular query i had.

So my RNA-Seq sample had 29,775,906 reads.

I ran this sample through tophat (no coverage search, unstranded, w/bowtie2). I checked the logs folder in the output and got the following stats :-

[2013-04-04 10:57:02] Preparing reads

     left reads: min. length=50, max. length=50, 29767088 kept reads (8818 discarded)

Now, 29767088 + 8818 = 29,775,906 . So until now it makes sense.

From the bowtie files i got the following information :-

In,bowtie.left_kept_reads.log

29767088 (100.00%) were unpaired; of these:

4316521 (14.50%) aligned 0 times

17118264 (57.51%) aligned exactly 1 time

8332303 (27.99%) aligned >1 times

In,bowtie.left_kept_reads_seg1.log

4316521 (100.00%) were unpaired; of these:

3198237 (74.09%) aligned 0 times

1037571 (24.04%) aligned exactly 1 time

80713 (1.87%) aligned >1 times

In bowtie.left_kept_reads_seg2.log

4316521 (100.00%) were unpaired; of these:

3177510 (73.61%) aligned 0 times

1069186 (24.77%) aligned exactly 1 time

69825 (1.62%) aligned >1 times

So at this point , correct me if i am wrong, we can conclude :-

total reads aligned 1 time = 17118264 + 1069186 + 1037571 = 19225021 total reads aligned >1 time = 8332303 + 80713 + 69825 = 8482841

But then i converted the accepted_hits.bam file in a .sam file. On this .sam file I did the following two operations :-

awk '{print $1}' 11129_nc_u.sam | sort | uniq -u | wc -l

gave me 21638486 (i.e total number of uniques)

and

awk '{print $1}' 11129_nc_u.sam | sort | uniq -d | wc -l

gave me 5519946 (i.e. total duplicates.)

Now why is the number of uniques from the .sam folder (21638486 ) greater than 19225021 (i.e. obtained from the logs folder)? In the same breath, why is the number of duplicates from .sam folder (5519946) less than the number in the logs folder (8482841) ?

Can someone please clarify this point for me?

Thanks in Advance :)

reads tophat rna-seq bowtie • 1.8k views
ADD COMMENTlink modified 6.6 years ago by vodka80 • written 6.6 years ago by yvs.081530
0
gravatar for vodka
6.6 years ago by
vodka80
Torino
vodka80 wrote:

Check here: http://seqanswers.com/forums/showpost.php?p=72778&postcount=4 As long as the two seg? logs derive from portions of you reads I believe that summing those numbers is wrong.

ADD COMMENTlink written 6.6 years ago by vodka80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1197 users visited in the last hour