Question

STAR mapping - regarding output files content

0

Entering edit mode

8 months ago

Manko47 • 0

Hello, I've 2 questions regarding some of the outputs produced by STAR mapper in my RNA-seq experiment - particularly the .sequenceReadsPerGene.out file and .sequenceLog file.

I'm summarising statistics regarding mapped/multimapped and unmapped reads etc. I counted the reported number of :

enter image description here

N_unmapped,

N_multimapped,

N_nofeature,

N_ambiguous

from sequenceReadsPerGene.out file (counted only the column most to the right since it's a reversely stranded experiment) as well as uniquely mapped reads from .sequenceLog file. However if I add them together then I'm well above 100% of mapped reads. Am I correctly assuming in this case that some of reads belong in multiple of those categories (for example uniquely mapped and noFeature). And in general what does STAR means with the noFeature category?

What is the exact difference between a mapped read and alignment in STAR? I created the gene_count_matrix for differential gene_expression analyses both utilising Featurecounts as well as the sequenceReadsPerGene.out file straight from STAR (this is a single-end experiment and the results are identical). However the total number of alignments reported in the log files is well above the total number of reads. Am I correctly assuming that this is because STAR counts every read that mapped to multiple places in reference genome as multiple alignments? And therefore if I have 30% of reads that mapped to multiple loci and I used the defaultt parameters (so only those that mapped to more than 10 are disregarded as mapped to too many locus) then they can create a huge number of alignment, because if a read mapped to 5 places - then it will be counted as 5 alignments? I also assume that only the uniquely mapped reads are being assigned to the genes (and not all of them).

STAR RNA-seq mapping • 1.2k views

ADD COMMENT • link updated 8 months ago by rfran010 ▴ 900 • written 8 months ago by Manko47 • 0

1

Entering edit mode

Are you saying the sum of unmapped, multimapped, nofeature, ambiguous features is greater than the reported uniquely mapped reads? This would make sense since, unmapped reads are in addition to mapped reads.

If you have a multimapping read that maps to five locations, then yes, there will be five alignments. However, this won't be reflected in the *ReadsPerGene.out file since, a it assigns the multimapping read to the N_multimapping feature (so only counts as one).

But, your total alignments in the bam file would be greater than total mapped reads.

To understand what the noFeature category is, you need to understand what the ReadsPerGene.out file is reporting. ReadsPerGene.out simply counts the number of reads that overlap a user-supplied annotation file. Usually, this would be a GTF file with gene annotations. If a read mapped to an intergenic region, then, it would not overlap with any feature, so it would be in the noFeature category.

ADD REPLY • link 8 months ago by rfran010 ▴ 900

0

Entering edit mode

Thank you - the answer to my second question is exactly what I hoped for so that one is closed.

As for the first question - not exactly - I'm saying that if I sum the number of unmapped, multimapped, nofeature, ambiguous as well as uniquely mapped reads, then that number is higher than the total number of input reads. I'm adding additional photos with the exact counts. Is that fine? I'm assuming that's because some of the noFeature and ambiguous reads also belong to the category of multi mapped/uniquely mapped?.

P.S. As I mentioned I only counted the column most to the right from the second photo since it's a reversely stranded data

enter image description here

ADD REPLY • link 8 months ago by Manko47 • 0

2

Entering edit mode

You need to understand that the gene count evaluation isn't really being done by STAR. It's an add-on algorithm (first seen in htseq-count) that happens after the alignment. So there is no reason to think that taking some numbers from one and some numbers from the other will add up to anythingmeaningful.

Specifically, when the unique mapped reads are counted, STAR has no idea if they are assignable to genes or not. Surely some of them are noFeature., or ambiguous.

The better thing to check is if the columns of the htseq-count type output all add up to what they should.

ADD REPLY • link 8 months ago by swbarnes2 14k

1

Entering edit mode

Yes that would be correct, some of the uniquely mapped reads would be counted as noFeature or ambiguous.

So by adding them to uniquely mapped reads, you are double counting them. Basically, if you sum the column for all features except N_unmapped and N_multimapping, you would get the Uniquely mapped reads number. This number plus the N_unmapped and N_multimapping will equal total input reads.

ADD REPLY • link 8 months ago by rfran010 ▴ 900

0

Entering edit mode

Manko47 : Please avoid posting screenshots. Using 101010 button allows you to post data as code which keeps its formatting.

ADD REPLY • link 8 months ago by GenoMax 141k