Question

RNA-seq analysis

0

Entering edit mode

5 months ago

sk_24 ▴ 10

I am analyzing RNA-seq data and I used HiSat2 for alignment of paired-end files. After getting the bam files (sorted by name for Htseq and by position for featurecounts), I want to know the strand level information for counting the reads mapped to genes as both these tools require strand specific information. How can I get to know about this?

Htseq-count HiSat2 featurecounts STAR • 1.5k views

ADD COMMENT • link updated 5 months ago by GenoMax 154k • written 5 months ago by sk_24 ▴ 10

GenoMax · Answer 1 · 2025-05-13

4

Entering edit mode

5 months ago

Gordon Smyth ★ 8.4k

featureCounts is so fast that you simply try both strands on one of the BAM files. From R, just try:

library(Rsubread)
fc1 <- featureCounts(yourbamfile, annot.ext = whatever, isGTFAnnotationFile = whatever, 
    strandSpecific = 1, isPairedEnd = TRUE, nthreads = whatever)
fc2 <- featureCounts(yourbamfile, annot.ext = whatever, isGTFAnnotationFile = whatever, 
    strandSpecific = 2, isPairedEnd = TRUE, nthreads = whatever)

If both choices give roughly the same number of counts, then the sequencing was unstranded, in which case you should go back to strandSpecific=0. If fc1 gives far more reads than fc2, then the sequencing was stranded. If fc2 gives far more reads, then the sequencing was reverse-stranded.

ADD COMMENT • link 5 months ago by Gordon Smyth ★ 8.4k

0

Entering edit mode

Will it work on BAM files that I obtained from HISAT2 with default option of strandedness, i.e. unstranded? Or STAR aligned BAM files (which does not consider the strandedness) can also be used?

ADD REPLY • link 5 months ago by sk_24 ▴ 10

1

Entering edit mode

Yes, the code I gave you will work on any BAM file. In general, strand information is used only when assigning read alignments to genes, not when aligning the reads to the reference genome in the first place. The BAM file stores only the alignment.

ADD REPLY • link 5 months ago by Gordon Smyth ★ 8.4k

0

Entering edit mode

I tried with 2 paired-end BAM files from different GSE IDs. One of them gave 35149287 fc0 reads, 18809574 fc1 reads and 18645129 fc2 reads. The other BAM file gave 29345428 fc0 reads, 27236575 fc1 reads and 27239921 fc2 reads. For the second file, fc1 and fc2 give roughly the same reads while for the first file, although fc1 and fc2 reads are close enough, fc0 is approximately double of fc1 or fc2. Should I consider both of these files as unstranded?

ADD REPLY • link 5 months ago by sk_24 ▴ 10

1

Entering edit mode

Please post the complete featureCounts assignments rather than just read numbers. Which can be found in the .summary file.

ADD REPLY • link 5 months ago by GenoMax 154k

1

Entering edit mode

It is the assigned reads that are important, not total reads. I will assume that you are reporting the number of assigned reads, although your comment does not specify.

The first file is clearly unstranded.

The results that you report from the other file are seemingly impossible if you have run the commands correctly and have provided a GTF file that contains strand information. Something is very wrong, but I cannot diagnose what that might be from the very limited information you have provided.

ADD REPLY • link 5 months ago by Gordon Smyth ★ 8.4k

0

Entering edit mode

Yes, I mentioned the assigned reads from the featurecounts output. There was a mistake in the Hisat2 alignment command. The R2 pair files were not specified. So, it reported most of the reads as chimera. I corrected it and below is the complete featurecounts output:

**Ran with unstranded parameter:**
Assigned    23576090
Unassigned_Unmapped 327686
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  165242
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 4082297
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   516646
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    1131590

**Ran with strandedness = 1** 
Assigned    12691428
Unassigned_Unmapped 327686
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  165242
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 4082297
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   12263387
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    269511

**Ran with strandedness = 2**
Assigned    12579729
Unassigned_Unmapped 327686
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  165242
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 4082297
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   12377707
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    266890

ADD REPLY • link updated 5 months ago by GenoMax 154k • written 5 months ago by sk_24 ▴ 10

0

Entering edit mode

OK now. Clearly the sequencing was unstranded. Just use featureCounts with default strandSpecific=0.

ADD REPLY • link 5 months ago by Gordon Smyth ★ 8.4k

score 3 · Answer 2 · 2025-05-13

require strand specific information

See the answers in --> How to find rna strand direction before alignment? for tools that you can use with your aligned data files. https://github.com/NBISweden/GUESSmyLT and https://rseqc.sourceforge.net/#infer-experiment-py are direct tools.

You can also use salmon to figure out this information: https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype