Question

Count antisense transcripts using STAR --quantMode

1

Entering edit mode

5.1 years ago

Bastien Hervé 5.3k

I have got some RNAseq data from Mus Musculus using TruSeq Stranded Total RNA in paired-end. I want to count my reads over my genes regarding to the gene orientation. As if a gene is plus strand, reads on forward strand falling into my plus gene area should be counted and if a gene is minus strand, reads on reverse strand falling into my minus gene area should be counted.

Let's take two examples from Ensembl :

Hmgb2 : ENSMUSG00000054717.7 chr8:57,511,907-57,515,999 forward strand http://www.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000076617;r=12:113418558-113422730
Ighm : ENSMUSG00000076617.9 chr12:113,418,558-113,422,730 reverse strand http://www.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000054717;r=8:57511907-57515999

I created my index with a gencode reference genome and I did my alignments using STAR with the --quantMode and an annotation file from gencode. I've got my _Aligned.sortedByCoord.out.bam and _ReadsPerGene.out.tab.

Check the strand of my 2 genes into my annotation file :

zgrep -i --color "Hmgb2" gencode.vM18.chr_patch_hapl_scaff_and_23_custom_igh_gff3sort.annotation.gtf.gz

#chr8   HAVANA  gene    57511907    57515999    .   +   .   gene_id "ENSMUSG00000054717.7"; gene_type "protein_coding"; gene_name "Hmgb2"; level 2; havana_gene "OTTMUSG00000060717.1";

zgrep -i --color "Ighm" gencode.vM18.chr_patch_hapl_scaff_and_23_custom_igh_gff3sort.annotation.gtf.gz

#chr12  HAVANA  gene    113418558   113422730   .   -   .   gene_id "ENSMUSG00000076617.9"; gene_type "IG_C_gene"; gene_name "Ighm"; level 2; havana_gene "OTTMUSG00000051485.2";

Ok, good I have Hmgb2 plus strand and Ighm minus strand.

Check the read strand under IGV for these two genes :

Hmgb2

Ighm

Hmgb2 gets a lot of paired-read F2R1 (blue reads) and vice-versa Ighm gets a lot of paired-read F1R2 (red reads)

Note : As paired-end Illumina sequencing is R1 forward and R2 reverse, I was expecting R1 to lead the forward strand, which is not my case but anyway, it is not the point, R2 is leading the forward strand.

Reading the documentation of STAR, part 7.Counting number of reads per gene.

STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:

column 1: gene ID

column 2: counts for unstranded RNA-seq

column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)

column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

Select the output according to the strandedness of your data. Note, that if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads.

So, what I get is that, either the 3rd column give me my sense count or antisense, and the 4th will give me the opposite strand result.

Fine. But I want to know if STAR take the gene strand into account. I've found this threa thread (with $2 = column2, $3 = column3, $4 = column4)

$2 is for unstranded hits, but those overlapping on opposite strand of features are considered ambiguous. $3 reports hits based on the strand you have given in your gff annotation, and $4 in the reverse direction of your features in gff (for PE-data the 5'3'-direction is also considered). Refer to -s option of htseq-count

So, I was expecting to get high count of Hmgb2 in either column 3 or 4 and high count of Ighm in the other column

Hmgb2 / ENSMUSG00000054717.7

grep "ENSMUSG00000054717.7" file_ReadsPerGene.out.tab

#ENSMUSG00000054717.7   3400    31  3369

Ighm / ENSMUSG00000076617.9

grep "ENSMUSG00000076617.9" file_ReadsPerGene.out.tab

#ENSMUSG00000076617.9   16063   11  16052

All my high counts are in the 4th column... Did I forgot to tweak some options ?

STAR RNA-Seq --quantMode • 3.5k views

ADD COMMENT • link updated 5.1 years ago by Devon Ryan 104k • written 5.1 years ago by Bastien Hervé 5.3k

score 2 · Accepted Answer · 2019-03-06

2

Entering edit mode

5.1 years ago

benformatics 3.9k

Most standard stranded Illumina RNA-Seq (e.g. TruSeq) sequencing protocols sequence the first strand of the cDNA which is generated by reverse transcribing the mRNA. What this means is that most of your "fragments" (i.e. reads) from a given feature are on the reverse strand.

If it is not clear from the above statement. The counts in the "reverse" column are your actual feature counts.

See #6-#7 (the fragment being sequenced matches the DNA molecule generated in #2) in this image:

If you use something like Y-shaped adapters then your reads are generally not anti-sense (e.g. smallRNA kit).

ADD COMMENT • link 5.1 years ago by benformatics 3.9k

0

Entering edit mode

It does not bother me to have my count in the antisense column but I was expecting my gene counts from 2 genes stranded in different way to be in different column.

Unless, a read is counted in the feature strandness ? Like, one read in reverse in a minus strand gene is counted as forward ?

ADD REPLY • link 5.1 years ago by Bastien Hervé 5.3k

1

Entering edit mode

Since not everyone does paired end, the first read is what is used for strand determination. Your fragment corresponds to the complement of your actual mRNA, so your first read will be anti-sense and your second read will be sense.

Making the call based on R1 is logical because for Illumina single-end sequencing you are just sequencing read #1 - so in all possible cases you have read #1.

ADD REPLY • link 5.1 years ago by benformatics 3.9k

0

Entering edit mode

I cannot see your pictures please see : How to add images to a Biostars post

ADD REPLY • link 5.1 years ago by Bastien Hervé 5.3k

1

Entering edit mode

Something is off about Biostars and images...I can see the images fine on my phone, but not at my work desktop.

ADD REPLY • link 5.1 years ago by swbarnes2 14k

0

Entering edit mode

What about now?

ADD REPLY • link 5.1 years ago by benformatics 3.9k

0

Entering edit mode

All good now 👍

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

Still no on my desktop...

ADD REPLY • link 5.1 years ago by swbarnes2 14k

0

Entering edit mode

Happend to me yesterday

ADD REPLY • link 5.1 years ago by Bastien Hervé 5.3k

Bastien Hervé · Accepted Answer · 2019-03-06

2

Entering edit mode

5.1 years ago

Devon Ryan 104k

Most stranded libraries produced in the past ~7 years should correspond to the counts in the last column, wherein read 2's orientation matches that of the transcript. The description you linked to is wrong. Column 2 is for unstranded libraries, so the strand of a feature on the genome is ignored. Column 3 assumes that the first read in a pair's orientation matches that of the originating fragment (this is rarely the case) and column four the same but for read 2. TruSeq kits use the standard method, so the last column is correct.

ADD COMMENT • link updated 5.1 years ago by Bastien Hervé 5.3k • written 5.1 years ago by Devon Ryan 104k

1

Entering edit mode

Thanks Devon. If i'm interesting in anti-sense transcript I need the 3rd column then ?