Hi,
I found a rather strange gene naming in my work, and I cant figure this out myself. I hope one of you can enlighten me or help me to correct for this.
I use these softwares to map raw fastq files (RNAbulk sequencing data) to reference genome and counting:
- hisat2 2.2.1
- featureCounts 1.5.3
- human reference genome GRCh38.p14 from NCBI.
What I experience is that some gene names (e.g., STING1 and IL3RA are named STING1_1
and IL3RA_1
(notice the _1
) in my countmatrix. I am extreamly confused to why this would occur, and it makes the searching for specific genes extremely challenging because I have no clue if the software added _1
or something else. This does not happen for all gene names, but _1
is added to less than 1% of all gene names in the countmatrix.
You can see an example of counts found of STING1
, STING1_
and STING1_1
here:
A quick solution I thought of is to write code that removes _1
from any gene name. However, I’m unsure whether _1
is ever part of a real gene name that I’m not yet aware of. Because of this, I’m worried that I might accidentally modify a legitimate gene name incorrectly. Do any of you know if I can apply this solution, or do you have any other solutions?
Thank you!
_1
is not a part of any gene symbol as far as I'm aware. It is possible that some piece of software along the way chose to address duplicates in this manner. Where did you run your software list - was it on a local cluster or on a cloud platform?Thank you for replying.
I am working on a local cluster (remote access to a HPC), but would that have anything to do with the naming of genes? This addition of _1 happens only for some genes. The majority of the gene names are as expected, but occasionally I run into genes having added _1.
The reason I asked that question was that you can access the resources used in more depth when you're on a cluster and also get help from a sysadmin if you need to.
Can you show us your commands here? Also, please use a gist and do not use screenshots.
Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.