Question

Is there a way to safely remove all underscore from gene id for hg19 from genecode

0

Entering edit mode

2.3 years ago

simplitia ▴ 130

Hi, I realize that https://www.gencodegenes.org/human/release_33lift37.html gene codes hg19 gtf files has a strange annotation with an underscore append to the end of each of the gene id. Is there a way to safely remove all the underscore? The file annotations looks something like this.

chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5_4"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2_4"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1_1"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_original_location "chr1:+:11869-12227"; remap_status "full_contig";
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1_1"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_original_location "chr1:+:12613-12721"; remap_status "full_contig";
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1_1"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_original_location "chr1:+:13221-14409"; remap_status "full_contig";
chr1    HAVANA  transcript      12010   13670   .       +       .       gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000450305.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000002844.2_1"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";
chr1    HAVANA  exon    12010   12057   .       +       .       gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000450305.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; exon_number 1; exon_id "ENSE00001948541.1_1"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000002844.2_1"; remap_original_location "chr1:+:12010-12057"; remap_status "full_contig";
chr1    HAVANA  exon    12179   12227   .       +       .       gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000450305.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; exon_number 2; exon_id "ENSE00001671638.2_1"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000002844.2_1"; remap_original_location "chr1:+:12179-12227"; remap_status "full_contig";

rnaseq genecode • 1.6k views

ADD COMMENT • link 2.3 years ago by simplitia ▴ 130

score 0 · Answer 1 · 2021-12-20

0

Entering edit mode

2.3 years ago

cpad0112 21k

$ awk '{sub("_","",$10)}1' test.txt

test.txt is text in OP.

ADD COMMENT • link 2.3 years ago by cpad0112 21k

0

Entering edit mode

thanks; it did'nt really work right so I ended up using sed command as such. The important switch here was /g for replacing everything. What I'm still a bit worry about is some obscure gene or issues where there is an underscore with a number GENENAME_2 will mess this up.

sed 's/_[0-9]//g' gencode.v37lift37.basic.annotation.gtf > gencode.v37.FIXED.gtf

ADD REPLY • link 2.3 years ago by simplitia ▴ 130

0

Entering edit mode

To remove only "_" , try this (From ENSG00000223972.5_4 to ENSG00000223972.54 in gene_id column):

head gencode.v37lift37.basic.annotation.gtf | awk -v FS="\"" -v OFS="\"" '{sub("_","",$2)}1'

To remove only "_number" , try this (From ENSG00000223972.5_4 to ENSG00000223972.5 in gene_id column):

head gencode.v37lift37.annotation.gtf | awk -F "gene_id" -v OFS="gene_id" '{sub("_[0-9]","",$2)}1'

ADD REPLY • link 2.3 years ago by cpad0112 21k

0

Entering edit mode

thanks again for your help. Yes what I want is to eliminate any number after a _ underscore, so ENSG00000223972.5_4 to ENSG00000223972.5 sed does this correctly but the new awk command you sent collapses and gives ENSG00000223972.54 instead. Its actually a bit weird why no one else seem to be bother by this since this underscore messes up a lot of downstream programs, may be is the lack of hg19 ? what would be really useful is if there is a way to make sure to only replace when the term starts with space follow by ^ENS, ^OTT or ^ENST that way I think it would be a safer route in case there are some other important nomenclature that uses this pattern.

ADD REPLY • link 2.3 years ago by simplitia ▴ 130

0

Entering edit mode

Updated the code. Please try the second one. Problem is not with editing. But with restoring the original format (gtf). Without going through multiple replacements, it is difficult to restore to original format with generic tools. Please post expected output next time.

ADD REPLY • link 2.3 years ago by cpad0112 21k

0

Entering edit mode

thanks that is great. Do you know if its possible to put OR statements to include havana_gene and hgnc_id flags as well, that way I don't have to run it muliple times for each annotations.

ADD REPLY • link 2.3 years ago by simplitia ▴ 130

0

Entering edit mode

can you post single line input and output example? I haven't seen versioning for hgnc_id. Try following and post if there are any issues:

$  sed -r 's/(gene_id\s\W\w+[0-9]+\.[0-9])_[0-9]/\1/;s/(havana_gene\s\W\w+[0-9]+\.[0-9])_[0-9]/\1/'  gencode.v37lift37.annotation.gtf