Is there a way to change gene_id in a GTF file where gene_id is actually a gene_name (and multiple genes have the same gene names) ?
0
1
Entering edit mode
17 months ago
nlehmann ▴ 130

I downloaded a GTF file from UCSC and I observed that all gene_id are identical to gene_names. Considering some genes have the same name but not the same ID and that it causes troubles with some tools, what is the best way to fix this ? I know how to solve the reverse problem (where gene names are gene IDs) by using a Biomart correspondence table with some awk statements, but this case seems trickier as we need to take into account the genes coordinates to define when there are multiple genes with the same name.

# extract of 2 lines of the file galGal6.ncbiRefSeq.gtf.gz - 2 genes with same name (different orientation and coordinates)
chr19   ncbiRefSeq  transcript  569498  570362  .   -   .   gene_id "CCL4"; transcript_id "NM_001030360.2"; gene_name "CCL4"; ref_gene_id "CCL4";
chr19   ncbiRefSeq  transcript  576493  578185  .   +   .   gene_id "CCL4"; transcript_id "XM_015295666.2"; gene_name "CCL4"; ref_gene_id "CCL4";


ENSGALG00000032717 CCL4
ENSGALG00000034478 CCL4


When looking in the UCSC genome browser, I know that ENSGALG00000032717 is the first one (coord: 569498-570362) and ENSGALG00000034478 is the second one (576493-578185), so I can correct it to this:

# corrected
chr19   ncbiRefSeq  transcript  569498  570362  .   -   .   gene_id "ENSGALG00000032717"; transcript_id "NM_001030360.2"; gene_name "CCL4"; ref_gene_id "CCL4";
chr19   ncbiRefSeq  transcript  576493  578185  .   +   .   gene_id "ENSGALG00000034478"; transcript_id "XM_015295666.2"; gene_name "CCL4"; ref_gene_id "CCL4";


Now my question is: is there an automatic way of doing this ?

Thanks for the help !

RNA-Seq assembly UCSC • 794 views
1
Entering edit mode

Is there a reason why you definitely want to use the UCSC annotations? Seems like this would be most easily solved by using the Ensembl annotations.

If you definitely want to use the refseq annotations, as there is no other gene in between the two CCL4 genes, I can't see any automatic way of doing this, without creating code that separates out adjacent transcripts into different genes by looking for clusters of overlapping transcripts.

0
Entering edit mode

I am using both annotations as they show some differences and wanted to compare them. So I counted how many gene names that had multiple gene IDs, and we have 56 genes in this situation. Just as a matter of reproducibility, here is how I found these 56 genes:

Data:

# mart_export.txt example
ENSGALG00000032717 CCL4
ENSGALG00000034478 CCL4


Script cmd.awk:

# to run the script:
# awk -F'\t' -v OFS="\t" -v colGeneNameBiomart="2" -f cmd.awk mart_export.txt | sort
{
biomartTable[$1] =$colGeneNameBiomart
}
END {
for (x in biomartTable){
geneCounter[biomartTable[x]] += 1
}
for (x in geneCounter){
if (geneCounter[x] > 1){
print x, geneCounter[x]
}
}
}


Output :

# col1:  gene_name
# col2: number of different Ensembl geneIDs for this gene name
5S_rRNA 53
7SK 9
ABCB1   2
CCL4    2
CD8A    3
FK21    6
FK27    2
gga-mir-12222   2
gga-mir-6566    3
gga-mir-6608-1  3
gga-mir-6637-1  2
gga-mir-7450    2
gga-mir-7458-1  2
gga-mir-7461    4
gga-mir-7470    2
gga-mir-7478    18
gga-mir-7479-1  16
gga-mir-7479-2  2
gga-mir-7479-3  8
gga-mir-7482-1  3
gga-mir-7482-6  2
H2A-VIII    7
IFN-A   5
IL1R1   2
LDB3    2
LMNA    2
Metazoa_SRP 5
MGAT4C  2
NEBL    2
NPPC    2
POU4F3  2
PPP1R9B 2
SNORA50A    2
SNORA62 2
SNORA66 2
SNORD14 3
SNORD15 3
SNORD16 2
SNORD18 2
SNORD24 2
SNORD36 2
SNORD58 8
SNORD63 2
SNORD83 2
TGM2    2
U1  20
U2  5
U3  3
U4  5
U5  5
U6  28
Vault   2
Y_RNA   3


With colGeneNameBiomart the column number where there is the gene name. But still, changing these 56 genes by hand is not really a solution.

There is also the original NCBI GTF file that I could use. It looks like it (same CCL4 example):

NC_006106.5 BestRefSeq  gene    569498  570362  .   -   .   ID=gene-CCL4;Dbxref=GeneID:395551;Name=CCL4;description=chemokine (C-C motif) ligand 4;gbkey=Gene;gene=CCL4;gene_biotype=protein_coding;gene_synonym=MIP-1beta,SCYA4
NC_006106.5 BestRefSeq%2CGnomon gene    576493  578185  .   +   .   ID=gene-CCL4-2;Dbxref=CGNC:49334,GeneID:395468;Name=CCL4;description=C-C motif chemokine ligand 4;gbkey=Gene;gene=CCL4;gene_biotype=protein_coding;gene_synonym=K203


And then I can find the correspondances in the biomart table...

# mart_export.txt with Entrez ID
Gene stable ID  Gene name   NCBI gene (formerly Entrezgene) ID
ENSGALG00000032717  CCL4    395551
ENSGALG00000034478  CCL4    395468


That's the only thing that would make sense for me, but I thought it was a more common problem. But still, this works if we use original NCBI gtf instead of UCSC ncbi RefSeq GTF, so we didn't really answer the question in the end. I can't understand why UCSC doesn't use geneID (Ensembl, Entrez, whichever) though...

0
Entering edit mode

The UCSC datasets were never really design to be used in analyses I don't think, just for visualization.

It is sort of a problem in Human as well, in that there are genes with the same name in different locations, but at least in that case they are on different chromosomes.

0
Entering edit mode

Do you make such a difference with datasets from NCBI vs Ensembl ? Which database do you consider is best ?

0
Entering edit mode

The databases are moving more towards each other, at least for humans. But in general, as a rule of thumb, Ensembl is more comprehensive and Refseq is more selective. We usually use Ensembl, but use RefSeq occasionally for particular things where the presence of suspect transcripts in Ensembl is a problem.

0
Entering edit mode
gunzip -c galGal6.ncbiRefSeq.gtf.gz |  awk '{gsub("gene_name \"",sprintf("gene_name \"id.%s.",NR));print}'


?

0
Entering edit mode

It doesn't answer the question, I'll edit it to make it clearer