Incorrect gene naming (SYMBOL)
1
0
Entering edit mode
4 months ago
Biomed-jeh ▴ 70

Hi,

I found a rather strange gene naming in my work, and I cant figure this out myself. I hope one of you can enlighten me or help me to correct for this.

I use these softwares to map raw fastq files (RNAbulk sequencing data) to reference genome and counting:

  • hisat2 2.2.1
  • featureCounts 1.5.3
  • human reference genome GRCh38.p14 from NCBI.

What I experience is that some gene names (e.g., STING1 and IL3RA are named STING1_1 and IL3RA_1 (notice the _1) in my countmatrix. I am extreamly confused to why this would occur, and it makes the searching for specific genes extremely challenging because I have no clue if the software added _1 or something else. This does not happen for all gene names, but _1 is added to less than 1% of all gene names in the countmatrix.

You can see an example of counts found of STING1, STING1_ and STING1_1 here:

enter image description here

A quick solution I thought of is to write code that removes _1 from any gene name. However, I’m unsure whether _1 is ever part of a real gene name that I’m not yet aware of. Because of this, I’m worried that I might accidentally modify a legitimate gene name incorrectly. Do any of you know if I can apply this solution, or do you have any other solutions?

Thank you!

Gene-annotation • 1.1k views
ADD COMMENT
1
Entering edit mode

_1 is not a part of any gene symbol as far as I'm aware. It is possible that some piece of software along the way chose to address duplicates in this manner. Where did you run your software list - was it on a local cluster or on a cloud platform?

ADD REPLY
0
Entering edit mode

Thank you for replying.

I am working on a local cluster (remote access to a HPC), but would that have anything to do with the naming of genes? This addition of _1 happens only for some genes. The majority of the gene names are as expected, but occasionally I run into genes having added _1.

ADD REPLY
0
Entering edit mode

The reason I asked that question was that you can access the resources used in more depth when you're on a cluster and also get help from a sysadmin if you need to.

Can you show us your commands here? Also, please use a gist and do not use screenshots.

ADD REPLY
0
Entering edit mode

Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.

code_formatting

ADD REPLY
1
Entering edit mode
4 months ago

A quick solution I thought of is to write code that removes _1 from any gene name. However, I’m unsure whether _1 is ever part of a real gene name that I’m not yet aware of. Because of this, I’m worried that I might accidentally modify a legitimate gene name incorrectly

A lot of bioinformatics can involve munging data that collects a lot of custom or undocumented prefixes and suffixes along the way, before it gets to you.

You basically have to do the best job you can to clean things up, and deal rationally with edge cases. Maybe some kind of flowchart-like approach to apply logic to deal with problem cases may help:

  1. Search for a gene name/symbol.
  2. If there is a match, look for a suffix.
  3. Strip the suffix and repeat step 1.
  4. If there is also a match, put both matches into a special list you can investigate by hand, later on.
  5. If there is no match, keep the first match, but perhaps file the second as a symbol that may need searching in another database.

Separately:

  1. Search for a gene name/symbol.
  2. If there is no match, look for a suffix.
  3. If no suffix is available, file it as a symbol that may need searching in another database.
  4. If a suffix is available, strip it and repeat step 1.
  5. If there is a match, store it in a list of hits.

A related problem comes from datasets processed with Excel. Some gene names have been renamed, but it is perhaps also as much a tooling problem as a documentation problem.

ADD COMMENT
0
Entering edit mode

Thank you for the insight. I was not aware that an article had been released, related to this topic.

ADD REPLY

Login before adding your answer.

Traffic: 3465 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6