Question

Infering strand specificity of bam files using rseqc?

2

Entering edit mode

3.1 years ago

DareDevil ★ 4.3k

I have run infer_experiment.py of rseqc package to identify the strandedness of the aligned bam file so that I can feed -s option of featureCounts . I used following command to generate the output: infer_experiment.py -r hg38.bed -i xxy2.sort.bam

The output was:

This is PairEnd Data
Fraction of reads failed to determine: 0.0020
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0906
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9073

But I am confused with the result. Here, they state Pair-end non strand specific as:

This is PairEnd Data
Fraction of reads failed to determine: 0.0172
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4903
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4925

and Pair-end strand specific as:

This is PairEnd Data
Fraction of reads failed to determine: 0.0072
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9441
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0487

In Pair-end non strand specific case output is explained by both "1++,1--,2+-,2-+" and "1+-,1-+,2++,2--" since they have similar fractions.

In Pair-end strand specific case output is explained by “1++,1–,2+-,2-+” as it has the major fraction.

But my output is explained by "1+-,1-+,2++,2--".

Here what is the strand specificity which I can feed to --strandedness option in featureCounts. Any help appreciated.

RSeQC bam featureCounts • 1.7k views

ADD COMMENT • link updated 3.1 years ago by Istvan Albert 100k • written 3.1 years ago by DareDevil ★ 4.3k

1

Entering edit mode

I understand that this data is strand specific ("1+-,1-+,2++,2--") as:

read1 mapped to '+' strand indicates parental gene on '-' strand
read1 mapped to '-' strand indicates parental gene on '+' strand
read2 mapped to '+' strand indicates parental gene on '+' strand
read2 mapped to '-' strand indicates parental gene on '-' strand

ADD REPLY • link 3.1 years ago by DareDevil ★ 4.3k

score 2 · Answer 1 · 2021-03-30

Ironically the output of the tool supposed to help you decide only adds to the confusion.

Here is a better way to do it, that does not require running tools, instead asks you to understand your own data.

Go to the documentation for GUESSmyLT you don't have to run the tool, instead study and understand what the possible orientations are as described in the docs:

https://github.com/NBISweden/GUESSmyLT

Now open up your BAM and transcript GTG files in IGV and select "view as pairs" options (right-click the panel). Now all you need is to look at your data relative to a gene and visually evaluate which case do you have. The patterns are usually absolutely evident at first glance.

featurecounts has only three options 0, 1 and 2, where

0 is unstranded,
1 is where first in pair has the same direction as the gene,
2 is where first in pair has the opposite direction as the gene,