Question

STAR align quantMode

0

Entering edit mode

23 months ago

a_bis ▴ 40

Hi, I'm using STAR to align RNA-seq data to mm39. I am using --quantMode geneCounts as an option and the results I get are in the (very impractical for my purpose) 'ENSMUSG' format. Is there a way to get gene names instead of the Ensemble gene IDs?

Additionally, I'm confused as to why I get three columns of counts for each gene (as shown below) -- I'm only aligning one forward and one reverse fastq file, so shouldn't I be getting one set of reads per gene?

N_unmapped 16802670 16802670 16802670

N_multimapping 4055291 4055291 4055291

N_noFeature 22948357 58749681 25405703

N_ambiguous 3018494 37339 1274777

ENSMUSG00000102628 0 0 0

ENSMUSG00000100595 0 0 0

ENSMUSG00000097426 0 0 0

ENSMUSG00000104478 0 0 0

ENSMUSG00000104385 0 0 0

ENSMUSG00000086053 21 25 0

If it helps, the code I used to index the genome and align my fastq files is the following:

STAR --runMode genomeGenerate --genomeDir mm39index --genomeFastaFiles /path/to/file/Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile /path/to/file/Mus_musculus.GRCm39.104.gtf --runThreadN 16 


STAR --runThreadN 16 --genomeDir path/to/mm39index --readFilesIn blah_1.fastq.gz blah_2.fastq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix alignments/blah-alignment

Thanks in advance for the help!

gene genecounts counts quantmode star • 1.1k views

ADD COMMENT • link 23 months ago by a_bis ▴ 40

0

Entering edit mode

23 months ago

binodregmi30 ▴ 10

HTseq could be another option for feature count rather than usisng quant mode in STAR. I tried that berfore but ended up using HTseq. The HTseq outputs can be combined to count matrix easily and feed to analysis pipeline.

ADD COMMENT • link 23 months ago by binodregmi30 ▴ 10

0

Entering edit mode

Thank you for the tip, I will keep this in mind too!

ADD REPLY • link 23 months ago by a_bis ▴ 40

score 3 · Accepted Answer · 2022-05-13

3

Entering edit mode

23 months ago

rpolicastro 13k

The three columns correspond to (I forget the exact order) stranded, reverse stranded, and unstranded library types.

It's recommended to keep them as gene ids and then merge a list of gene names because gene ids are unique, but gene names aren't. You can get a id to name mapping from the gtf file you used, or a database like biomart.

If all you care about is quantification Salmon is usually recommended these days. It gives more accurate quantification due to being able to quantify abundances at the transcript level, allowing corrections for things like transcript length bias.