Question

featureCounts in predicting strand specificity

2

Entering edit mode

7.3 years ago

vrehaman ▴ 30

Dear Team,

I am working on RNASeq data which is generated using Truseq straded HT kit.

I have done removing adapters and aligned reads using STAR 2 pass mode.

I am running feature counts (subread 1.5.1) using -s option three times 0,1 and 2. More reads were mapped to features when I specified -s as "0". But rseqc infer it as reverse stranded. I ran htseq-count (with default parameters) and it is also giving the strand as reverse. Here are the commands which I used:

featureCounts -t exon -s 1 -T 12 -g gene_id -a genes.gtf -o featurecounts.txt sample.starAligned.sortedByCoord.out.bam sum of counts : 34133722

featureCounts -t exon -s 2 -T 12 -g gene_id -a genes.gtf -o featurecounts.txt sample.starAligned.sortedByCoord.out.bam sum of counts : 34029447

featureCounts -t exon -s 0 -T 12 -g gene_id -a genes.gtf -o featurecounts.txt sample.starAligned.sortedByCoord.out.bam sum of counts : 63157326

Am I using parameters correctly? And summing up counts on 7th column of featurecounts output.Is this right way to predict strand?

Please let me know your suggestions.

Thanks In Advance Fazulur Rehaman

rna-seq • 5.5k views

ADD COMMENT • link updated 7.3 years ago by igor 13k • written 7.3 years ago by vrehaman ▴ 30

GenoMax · Answer 1 · 2018-04-04

6

Entering edit mode

7.3 years ago

igor 13k

Often, not using strand info will yield slightly more counts than using it since you are also counting reads on the wrong strand. Thus, it's not surprising that you are seeing more with 0 than 2.

In your particular case, it looks like you are seeing roughly the same number of reads on the forward and reverse strands. This may not be a stranded library or you have high DNA contamination.

You can also refer to this strandness cheatsheet which may help you to understand the situation better: https://github.com/igordot/genomics/blob/master/notes/rna-seq-strand.md

ADD COMMENT • link 7.3 years ago by igor 13k

0

Entering edit mode

Hi igor,

Thanks a lot for your quick response.

You are right. Even I am sure that if it is stranded library, we should not get same count on both strands. In htseq-count also I got the numbers like below -s reverse 31842773 -s no 31610810 -s yes 2515562

But, I enquire our lab people who did sequencing and they have used dUTP method which is also mentioned in the link you shared.

Here is the result from rseqc infer_experiment.py on the sample which I performed featureCounts.

Fraction of reads explained by "1++,1--,2+-,2-+": 0.0568 Fraction of reads explained by "1+-,1-+,2++,2--": 0.9432 Fraction of reads explained by other combinations: 0.0000

And I observed preQC as well as in postQC, %GC is also little high (53).

If there is DNA contamination, what steps do I perform. Please help me to resolve this.

Thanks In Advance Fazulur Rehaman

ADD REPLY • link 7.3 years ago by vrehaman ▴ 30

0

Entering edit mode

Your htseq-count numbers are very different from your featureCounts. That is odd. They should be very similar (within 1%).

I am not familiar with rseqc, but it should be able to calculate how mapped reads were distributed over genome feature (like CDS exon, 5’UTR exon, 3’ UTR exon, Intron, Intergenic regions). If you have DNA contamination, those metrics would be similar.

ADD REPLY • link 7.3 years ago by igor 13k

0

Entering edit mode

Hi igor,

Thanks for your suggestions.

I am not sure why both tools were not giving similar results. I am using defaults. I will have a look into the same again. And sure, I will check rseqc metrics for any DNA contamination.

And also, I observed in featureCounts results, 20-30% of reads were not assigned due to mapped to multiple loci.

Please suggest me to resolve this.

Thanks & Regards Fazulur Rehaman

ADD REPLY • link 7.3 years ago by vrehaman ▴ 30

0

Entering edit mode

Hi igor,

As per your suggestion, I ran read_distribution.py on the same bam I am getting the output as below.

Total Reads                   42284852
Total Tags                    56689812
Total Assigned Tags           55703211
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb             
CDS_Exons           132048396           43289678            327.83            
5'UTR_Exons         0                   0                   0.00              
3'UTR_Exons         0                   0                   0.00              
Introns             1506563580          11389165            7.56              
TSS_up_1kb          33920391            35678               1.05              
TSS_up_5kb          150689832           123314              0.82              
TSS_up_10kb         268882891           209181              0.78              
TES_down_1kb        35777780            269422              7.53              
TES_down_5kb        154599566           682186              4.41              
TES_down_10kb       271774632           815187              3.00

=====================================================================

From above stats, I observed I am not getting similar metrics.

And also I ran bam_stat.py to know the spliced reads.

Total records:                          51540061

QC failed:                              0
Optical/PCR duplicate:                  0
Non primary hits                        7200705
Unmapped reads:                         2054504
mapq < mapq_cut (non-unique):           4465105

mapq >= mapq_cut (unique):              37819747
Read-1:                                 18909947
Read-2:                                 18909800
Reads map to '+':                       18909880
Reads map to '-':                       18909867
Non-splice reads:                       28164649
Splice reads:                           9655098
Reads mapped in proper pairs:           37819596
Proper-paired reads map to different chrom:0

I observed reads map to both '+' and '-" are almost same.

Could you please help me to resolve strand issue.

Thanks In Advance Fazulur Rehaman

ADD REPLY • link updated 6.7 years ago by GenoMax 152k • written 7.3 years ago by vrehaman ▴ 30

1

Entering edit mode

Hi vrehaman,

I am in a similar situation like you. I see the issue is with strands setting where featurecounts with -s 1 and -s 2 gave similar number of reads while I get double the number of reads assigned with -s 0. With RSeQC, I see most reads fraction (>90%) assigned to strand

Fraction of reads explained by "1++,1--,2+-,2-+": 0.9535
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0142

with HTSeq I get:

HtSeq Sum (stranded - s reverse)    =   2,192,168
HtSeq Sum (stranded - s no)         = 15,087,460
HtSeq Sum (stranded - s yes)            = 15,859,302

with featurecounts:

-s 0 = 32,092,726
-s 1 = 17,225,140
-s 2 = 17,228,515

Sequencing provider mentioned libraries are 'stranded' prepared with NuGen Universal Plus RNA-seq kit.

I wonder if stranded protocol did not work well in this case and why the counts are very different with HTSeq and featurecounts. Ideally they should be similar with HTSeq - s reverse and featurecount -s 2.

Do you know what kit your data was sequenced with and please post if you have figured out anything further on this.

ADD REPLY • link 6.7 years ago by sutturka ▴ 190

0

Entering edit mode

Are you sure you are properly accounting for paired reads? You have about 16M unstranded reads with HTSeq, but 32M with featureCounts. It looks like you ran featureCounts in single-read mode, which would explain twice as many reads and half of them switching strand.

ADD REPLY • link 6.7 years ago by igor 13k

0

Entering edit mode

I observed reads map to both '+' and '-" are almost same.

Sound like you do not have a stranded library.

ADD REPLY • link 7.3 years ago by igor 13k