How to change gene coordinate in gtf file?
1
0
Entering edit mode
7 months ago
Info.shi ▴ 10

Hi, I have gtf file I need to change the coordinate according to + and - strand to eliminate UTR region and consider CDs start and end coordinate.

My primary gtf file-

Chr_3a transdecoder gene 26355 34213 . - . ID=MSTRG.7.5

Chr_3a transdecoder cds 33198 33363 . - 0 ID=MSTRG.7.5

Chr_3a transdecoder cds 30850 31322 . - 2 ID=MSTRG.7.5

Chr_3a transdecoder cds 29756 30785 . - 0 ID=MSTRG.7.5

Chr_3a transdecoder cds 29426 29679 . - 2 ID=MSTRG.7.5

Chr_3a transdecoder gene 13108235 13128245 . + . ID=MSTRG.1

Chr_3a transdecoder cds 13113822 13113951 . + 0 ID=MSTRG.1

Chr_3a transdecoder cds 13114050 13114146 . + 2 ID=MSTRG..1

Chr_3a transdecoder cds 13114259 13114432 . + 1 ID=MSTRG..1

Chr_3a transdecoder cds 13116046 13116286 . + 1 ID=MSTRG.1

Chr_3a transdecoder cds 13117096 13120860 . + 0 ID=MSTRG..1

Expected formate

In - strand

Chr_3a transdecoder gene 29426 33363 . - . ID=MSTRG.7.5

Chr_3a transdecoder cds 33198 33363 . - 0 ID=MSTRG.7.5

Chr_3a transdecoder cds 30850 31322 . - 2 ID=MSTRG.7.5

Chr_3a transdecoder cds 29756 30785 . - 0 ID=MSTRG.7.5

Chr_3a transdecoder cds 29426 29679 . - 2 ID=MSTRG.7.5

While in + strand

Chr_3a transdecoder gene 13113822 13120860 . + . ID=MSTRG.1

Chr_3a transdecoder cds 13113822 13113951 . + 0 ID=MSTRG.1

Chr_3a transdecoder cds 13114050 13114146 . + 2 ID=MSTRG.1

Chr_3a transdecoder cds 13114259 13114432 . + 1 ID=MSTRG.1

Chr_3a transdecoder cds 13116046 13116286 . + 1 ID=MSTRG.1

Chr_3a transdecoder cds 13117096 13120860 . + 0 ID=MSTRG.1

Kindly suggest to me how to get my desirable output I am not good at programming and changing coordinates manually is very tough for all genes.

Thank you

perl python R • 401 views
0
Entering edit mode

This problems seems to be not about removing UTR genes but finding the to ends of the CDS regions, kind of a merging CDS regions that belong to the same transcript.

Looks at posts like these:

1
Entering edit mode
7 months ago
Shred ▴ 870

If you're not good at programming it may become an harsh task to do. I've worked on something similar so I could suggest a way to do that.

Split your gtf file into strand specific files.

awk -F'\t' '{if ($7=="+") print$0}' > forward.gtf

awk -F'\t' '{if ($7=="-") print$0}' > reverse.gtf


Then write a parser (in Python would be easier) where you define a class to store each gtf feature. Something like:

Gene x/
├─ Transcript X.1/
├─ Transcript X.2/
│  ├─ 5'UTR
│  ├─ CDS
│  ├─ 3' UTR
├─ Transcript X.n/
│  ├─ ..
│  ├─ ..


Then iterate over each gene to access each transcript: here you'll substract UTR coordinates from the Gene one and rewrite the record. Using a dictionaries in Python to store gene/transcript features, you could preserve adding order to edit only the first/last CDS according to the UTR coordinates.

I wrote a parser some time ago, intended to do se opposite thing: add 3' UTR while missing into a GTF file. There you could find this data structure implemented, which is basically a nested Ordered dict implemented in Python3: but as you've said that your programming skills are not that good, maybe a better idea would be to pass this concept to someone able to implement.