I have got some RNAseq data from Mus Musculus using TruSeq Stranded Total RNA in paired-end. I want to count my reads over my genes regarding to the gene orientation. As if a gene is plus strand, reads on forward strand falling into my plus gene area should be counted and if a gene is minus strand, reads on reverse strand falling into my minus gene area should be counted.
Let's take two examples from Ensembl :
I created my index with a gencode reference genome and I did my alignments using STAR with the
--quantMode and an annotation file from gencode. I've got my
Check the strand of my 2 genes into my annotation file :
zgrep -i --color "Hmgb2" gencode.vM18.chr_patch_hapl_scaff_and_23_custom_igh_gff3sort.annotation.gtf.gz #chr8 HAVANA gene 57511907 57515999 . + . gene_id "ENSMUSG00000054717.7"; gene_type "protein_coding"; gene_name "Hmgb2"; level 2; havana_gene "OTTMUSG00000060717.1"; zgrep -i --color "Ighm" gencode.vM18.chr_patch_hapl_scaff_and_23_custom_igh_gff3sort.annotation.gtf.gz #chr12 HAVANA gene 113418558 113422730 . - . gene_id "ENSMUSG00000076617.9"; gene_type "IG_C_gene"; gene_name "Ighm"; level 2; havana_gene "OTTMUSG00000051485.2";
Ok, good I have
Hmgb2 plus strand and
Ighm minus strand.
Check the read strand under IGV for these two genes :
Hmgb2 gets a lot of paired-read F2R1 (blue reads) and vice-versa
Ighm gets a lot of paired-read F1R2 (red reads)
Note : As paired-end Illumina sequencing is R1 forward and R2 reverse, I was expecting R1 to lead the forward strand, which is not my case but anyway, it is not the point, R2 is leading the forward strand.
Reading the documentation of STAR, part
7.Counting number of reads per gene.
STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:
- column 1: gene ID
- column 2: counts for unstranded RNA-seq
- column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
- column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)
Select the output according to the strandedness of your data. Note, that if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads.
So, what I get is that, either the 3rd column give me my sense count or antisense, and the 4th will give me the opposite strand result.
Fine. But I want to know if STAR take the gene strand into account. I've found this threa thread (with $2 = column2, $3 = column3, $4 = column4)
$2 is for unstranded hits, but those overlapping on opposite strand of features are considered ambiguous. $3 reports hits based on the strand you have given in your gff annotation, and $4 in the reverse direction of your features in gff (for PE-data the 5'3'-direction is also considered). Refer to -s option of htseq-count
So, I was expecting to get high count of
Hmgb2 in either column 3 or 4 and high count of
Ighm in the other column
grep "ENSMUSG00000054717.7" file_ReadsPerGene.out.tab #ENSMUSG00000054717.7 3400 31 3369
grep "ENSMUSG00000076617.9" file_ReadsPerGene.out.tab #ENSMUSG00000076617.9 16063 11 16052
All my high counts are in the 4th column... Did I forgot to tweak some options ?