Question: STAR quantMode geneCounts: weird outcomes
0
gravatar for davide.chiarugi
13 months ago by
davide.chiarugi20 wrote:

I have ran STAR 2.5.0a on my bulk RNA-seq data, obtained using a single-end stranded library preparation strategy.

I have set --quantMode GeneCounts, to obtain the counts from the ''embedded'' htseq-count.

I have obtained results like the following:


N_unmapped           146273    146273     146273
N_multimapping      3408293   3408293    3408293
N_noFeature          355858  17068060     392326
N_ambiguous         1189135     11003     513338
ENSG00000223972           0         0          0
ENSG00000227232           2         0          2
ENSG0000027826            0         0          0
ENSG00000243485           1         1          0

Up to my knowledge:

  • the values in the second column represent the amount of hits that would have been obtained if the library prep. would have been not strand-specific (--stranded=no);

  • the third column contains the amount of hits that would have been obtained if the library prep. would have been strand specific with the ''stranded = yes'' setting;

  • the fourth column contains the amount of hits that would have been obtained if the library prep. would have been strand specific with the ''stranded = reverse'' setting.

Globally the results I have obtained call for a library preparation strategy consistent with the ''stranded = reverse'' setting, which is perfectly fine.

Inspecting the columns, what I would expect is that the values in the second column would represent the sum of the third and fourth columns, like this:


ENSG00000279457 17  0   17
ENSG00000248527 1260    1   1259

With the second entry calling for 1259 hits for the sense RNA and 1 hit for a possible asRNA

Anyways, I have also entries like the followings:


ENSG00000228794 126 0   129
ENSG00000187634 128 621 185
ENSG00000131584 205 15  205

How can I interpret such results ?

rna-seq star htseq-count • 701 views
ADD COMMENTlink modified 13 months ago • written 13 months ago by davide.chiarugi20

Why is the outcome weird?

ADD REPLYlink written 13 months ago by h.mon28k

I have just completed the post: it was submitted incomplete by accident

ADD REPLYlink written 13 months ago by davide.chiarugi20

This would explain the cases in which you map on meta features and/or you have reads coming from both the sense and antisense RNA overlapping each other. The GFF is unlikely to have overlapping features because it is associated to the Human genome. Moreover I am mapping on features instead of meta features.

In every case, iverlapping features will not motivate entries like:

ENSG00000228794 126 0 129

ADD REPLYlink written 13 months ago by davide.chiarugi20
2

The coordinates for ENSG00000228794: Chromosome 1: 825,138-859,446

The coordinates for ENSG00000225880: Chromosome 1: 826,206-827,522

They overlap, and run in opposite directions. So you've likely got 3 reads that fall in the overlapped area. In an unstranded protocol, there's no way to know which gene they come from. When the software knows that reads must run reverse, it knows they go to 228794

ADD REPLYlink modified 13 months ago • written 13 months ago by swbarnes27.0k
2
gravatar for swbarnes2
13 months ago by
swbarnes27.0k
United States
swbarnes27.0k wrote:

The columns don't add up because there are overlapping features in your gtf, so the aligner can't always unambiguously assign a read to those features.

ADD COMMENTlink written 13 months ago by swbarnes27.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1784 users visited in the last hour