HT-Seq and count matrix
1
0
Entering edit mode
5 months ago
qudrat.nii ▴ 10

Hello,

I am trying to count reads using HT-Seq. In my gtf file, there are 60603 genes when I use the gene_id option I get all the genes in the count file with reads, but when I use the gene_name option I get only 59055 genes reads in the count file. Please look into this. Thank you

HT-Seq BAM • 581 views
ADD COMMENT
2
Entering edit mode
5 months ago

Generally a gene ID will uniquely map to only one gene name, but a gene name can map to multiple gene IDs. You can also have genes that have only a gene ID and no gene name. Both of these means there are generally less gene names than gene IDs in a GTF/annotation file.

Here's an example from the ensembl v110 human GTF file.

curl -O 'https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.chr.gtf.gz'

# The number of unique gene names.
gzip -dc Homo_sapiens.GRCh38.110.chr.gtf.gz | grep -o 'gene_name \"[^\"]*\"' | sort -u | wc -l
40871

# The number of unique gene IDs.
gzip -dc Homo_sapiens.GRCh38.110.chr.gtf.gz | grep -o 'gene_id \"[^\"]*\"' | sort -u | wc -l
62700
ADD COMMENT
0
Entering edit mode

what should be the correct way as mapping to gene ID will require to change it to gene name?

ADD REPLY
1
Entering edit mode

I usually keep them as gene IDs until some visualization or analysis requires a gene name. When this occurs I will either: 1) discard genes without a gene name, or 2) use the gene name unless the gene name is missing or there are more than one gene IDs mapping to the same gene name, in which case I will use the gene ID instead.

ADD REPLY
1
Entering edit mode

I usually do geneID_geneName. If then for visualization I need geneName, it is a simple gsub regex that trims the "_" and everything before it.

ADD REPLY

Login before adding your answer.

Traffic: 1748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6