Question

Best practices for unstranded sequences in featureCounts

0

Entering edit mode

5 months ago

Enrique • 0

Hi everyone, I'm using featureCounts to analyze some RNA-Seq data, but I have several doubts in the use with unstranded library.

First, when I analyze some SRA sequences or when I don't know the library type, I use Salmon to know it with the next command: salmon quant -p 32 -i index_salmon -l A -1 seq1.fastq.gz -2 seq2.fastq.gz --skipQuant -o results/.

Then, when I have unstranded library, I use hisat2 without specifying nothing about the strandedness (defaults option is unstranded library, so OK). Finally, I use featureCounts (2.0.6), but I have troubles with overlapped genes. My default options are: featureCounts -p --countReadPairs -d 0 -D 1000 -C -T 32 -a gff_file.gff -t exon -g Parent -o output_file.tsv input_files.bam.

Knowing that we have overlapped genes and unstranded sequences. What are the best options? featureCounts have the next arguments:

Overlap between reads and features

-O Assign reads to all their overlapping meta-features (or features if -f is specified).

--minOverlap <int> Minimum number of overlapping bases in a read that is required for read assignment. 1 by default. Number of overlapping bases is counted from both reads if paired end. If a negative value is provided, then a gap of up to specified size will be allowed between read and the feature that the read is assigned to.

--fracOverlap <float> Minimum fraction of overlapping bases in a read that is required for read assignment. Value should be within range [0,1]. 0 by default. Number of overlapping bases is counted from both reads if paired end. Both this option and '--minOverlap' option need to be satisfied for read assignment.

--fracOverlapFeature <float> Minimum fraction of overlapping bases in a feature that is required for read assignment. Value should be within range [0,1]. 0 by default.

--largestOverlap Assign reads to a meta-feature/feature that has the largest number of overlapping bases.

--nonOverlap <int> Maximum number of non-overlapping bases in a read (or a read pair) that is allowed when being assigned to a feature. No limit is set by default.

--nonOverlapFeature <int> Maximum number of non-overlapping bases in a feature that is allowed in read assignment. No limit is set by default.

--readExtension5 <int> Reads are extended upstream by <int> bases from their 5' end.

--readExtension3 <int> Reads are extended upstream by <int> bases from their 3' end.

--read2pos 5:3 Reduce reads to their 5' most base or 3' most base. Read counting is then performed based on the single base the read is reduced to.

For example, I have samples of pollen, and I know that a gene is only expressed in that tissue, the small antisense one in the image below. So, I want the best approaches in that case. An option to assign the count to that feature, and not to the bigger one (i.e. an option that can detect if the percentage of the count correspond to a feature, assign it to the one that has more relative percentage). For example, if the count covered 100% of the feature, but only 25% to the other feature, assign it to the 100%, and not to the other one.

jBrowse olea

featurecounts overlapping unstranded • 523 views

ADD COMMENT • link 5 months ago by Enrique • 0

score 2 · Accepted Answer · 2023-11-27

2

Entering edit mode

5 months ago

Istvan Albert 100k

What you seem to be after is to create a sort of expectation maximization algorithm out of featureCounts flags.

But the tool was not designed for that so I think you will have a hard time getting it to do what you want.

I would recommend that you count all reads for all overlapping features with -O then write a simple program in say Python to resolve this matters by that custom decision making you also describe above.

Salmon does have a builtin EM algorithm, so maybe you should use that - it might just work directly.

ADD COMMENT • link 5 months ago by Istvan Albert 100k

0

Entering edit mode

Thank you so much. Well, it’s not my intention to resolve my problem only with featureCounts, but I want to know the best arguments in that case. This is because I have several samples with unstranded sequences, and I want to avoid these kind of problems with overlapped genes. Also, in my group, featureCounts is commonly used in our workflows.

ADD REPLY • link 5 months ago by Enrique • 0