Question

Making Zero based coordinate system in gff file

0

Entering edit mode

5.8 years ago

1234anjalianjali1234 ▴ 50

Hellow,

I am doing gene wise analysis of intron and exon distribution, for this i have to convert my gff file information (which is based on chromosome) according to genes. I have 1 base coordinate file but Gene Structure Display Server input file should contain 0-based bed file.

##gff-version               
##sequence-region   ch00    1   20852292    
##sequence-region   ch01    1   98455869    
##sequence-region   ch02    1   55977580    
##sequence-region   ch03    1   72290146    
##sequence-region   ch04    1   66557038    


ch01    15504164    15510115    Gene1   gene
ch01    15504164    15510115    Gene1   mRNA
ch01    15504164    15504532    Gene1   three_prime_UTR
ch01    15504164    15505513    Gene1   exon
ch01    15504533    15505513    Gene1   CDS
ch01    15505592    15506740    Gene1   exon
ch01    15505592    15506740    Gene1   CDS
ch01    15507158    15507454    Gene1   exon
ch01    15507158    15507454    Gene1   CDS
ch01    15507599    15508670    Gene1   exon
ch01    15507599    15508670    Gene1   CDS
ch01    15509634    15510115    Gene1   exon
ch01    15509634    15510115    Gene1   CDS
ch01    72699527    72702973    Gene2   gene
ch01    72699527    72702973    Gene2   mRNA
ch01    72699527    72699756    Gene2   exon
ch01    72699527    72699756    Gene2   CDS
ch01    72699765    72699869    Gene2   exon
ch01    72699765    72699869    Gene2   CDS
ch01    72699915    72700248    Gene2   exon
ch01    72699915    72700248    Gene2   CDS
ch01    72700436    72700771    Gene2   exon
ch01    72700436    72700771    Gene2   CDS
ch01    72701150    72702213    Gene2   exon
ch01    72701150    72702213    Gene2   CDS
ch01    72702472    72702973    Gene2   exon
ch01    72702472    72702973    Gene2   CDS
ch04    54476287    54481244    Gene3   gene
ch04    54476287    54481244    Gene3   mRNA
ch04    54476287    54477248    Gene3   three_prime_UTR
ch04    54476287    54477746    Gene3   exon
ch04    54477249    54477746    Gene3   CDS
ch04    54477873    54479243    Gene3   exon
ch04    54477873    54479243    Gene3   CDS
ch04    54479340    54480432    Gene3   exon
ch04    54479340    54480432    Gene3   CDS
ch04    54480535    54481037    Gene3   CDS
ch04    54480535    54481244    Gene3   exon
ch04    54481038    54481244    Gene3   five_prime_UTR

I want to treat each gene individually i.e. for Gene1:

ch01    0    5951    Gene1   gene

and so on for cds and exons.

Is there any way to do it?

Thankyou

gff gene • 1.9k views

ADD COMMENT • link 5.8 years ago by 1234anjalianjali1234 ▴ 50

0

Entering edit mode

What is the input file format using for conversion? Is it .gff or .bed (1-based). [Clarification is needed because you have pasted bed file as an example.]

If it is .gff you can do it using gff2bed program implemented in BEDOPS

ADD REPLY • link 5.8 years ago by Nitin Narwade ★ 1.6k

0

Entering edit mode

i have gff file, but my problem is not converting .gff into .bed. my problem is to treat each gene individually apart from its position on chromosome, i want to start each gene with 0 and end with its corresponding value using gff file. For example, if my gene1 is on chromosome 4 and it starts from 568912 and ends at 569812, what i want is to start it from 0 and ends at 900 and do the same for its exon/CDS/intron respectively. I do not care about its chromosome position right know. I mentioned earlier that i want to use Gene Structure Display Server for my exon/intron visualization.

I hope you will get my point. Thankyou

ADD REPLY • link 5.8 years ago by 1234anjalianjali1234 ▴ 50

0

Entering edit mode

Thank you for the clarification 1234anjalianjali1234.

Can you please post sample data from your real gff file, because it would be easy to write or test the code. OR do you want to work on the data which you have posted in the question?

ADD REPLY • link 5.8 years ago by Nitin Narwade ★ 1.6k

0

Entering edit mode

Thankyou Nitin, I want to work on this data only, which I have extracted according to my genes of interest from my original gff file.

as per your request my sample gff is:

SL3.0ch00   maker_ITAG  gene    16480   17940   .   +   .   ID=gene:Solyc00g005000.3;Alias=Solyc00g005000;Name=Solyc00g005000.3;length=1460
SL3.0ch00   maker_ITAG  mRNA    16480   17940   .   +   .   ID=mRNA:Solyc00g005000.3.1;Parent=gene:Solyc00g005000.3;Name=Solyc00g005000.3.1;_AED=0.08;Note=Eukaryotic aspartyl protease family protein (AHRD V3.3 *** AT3G20015.1);Dbxref=InterPro:IPR001461;Ontology_term=GO:0004190,GO:0006508
SL3.0ch00   maker_ITAG  exon    16480   16794   .   +   .   ID=exon:Solyc00g005000.3.1.1;Parent=mRNA:Solyc00g005000.3.1
SL3.0ch00   maker_ITAG  CDS 16480   16794   .   +   0   ID=CDS:Solyc00g005000.3.1.1;Parent=mRNA:Solyc00g005000.3.1
SL3.0ch00   maker_ITAG  exon    16879   17940   .   +   .   ID=exon:Solyc00g005000.3.1.2;Parent=mRNA:Solyc00g005000.3.1
SL3.0ch00   maker_ITAG  CDS 16879   17940   .   +   0   ID=CDS:Solyc00g005000.3.1.2;Parent=mRNA:Solyc00g005000.3.1
###
SL3.0ch00   maker_ITAG  gene    548344  551581  .   +   .   ID=gene:Solyc00g005040.3;Alias=Solyc00g005040;Name=Solyc00g005040.3;length=3237
SL3.0ch00   maker_ITAG  mRNA    548344  551581  .   +   .   ID=mRNA:Solyc00g005040.3.1;Parent=gene:Solyc00g005040.3;Name=Solyc00g005040.3.1;_AED=0.20;Note=Potassium channel (AHRD V3.3 *-* D0EM91_9ROSI);Dbxref=InterPro:IPR000595,Pfam:PF00027
SL3.0ch00   maker_ITAG  exon    548344  548703  .   +   .   ID=exon:Solyc00g005040.3.1.1;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  five_prime_UTR  548344  548703  .   +   .   ID=five_prime_UTR:Solyc00g005040.3.1.0;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  exon    549226  549319  .   +   .   ID=exon:Solyc00g005040.3.1.2;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  five_prime_UTR  549226  549319  .   +   .   ID=five_prime_UTR:Solyc00g005040.3.1.1;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  exon    550857  550944  .   +   .   ID=exon:Solyc00g005040.3.1.3;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  five_prime_UTR  550857  550944  .   +   .   ID=five_prime_UTR:Solyc00g005040.3.1.2;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  five_prime_UTR  551033  551041  .   +   .   ID=five_prime_UTR:Solyc00g005040.3.1.3;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  exon    551033  551131  .   +   .   ID=exon:Solyc00g005040.3.1.4;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  CDS 551042  551131  .   +   0   ID=CDS:Solyc00g005040.3.1.1;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  exon    551217  551249  .   +   .   ID=exon:Solyc00g005040.3.1.5;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  CDS 551217  551249  .   +   0   ID=CDS:Solyc00g005040.3.1.2;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  CDS 551342  551575  .   +   0   ID=CDS:Solyc00g005040.3.1.3;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  exon    551342  551581  .   +   .   ID=exon:Solyc00g005040.3.1.6;Parent=mRNA:Solyc00g005040.3.1
SL3.0ch00   maker_ITAG  three_prime_UTR 551576  551581  .   +   .   ID=three_prime_UTR:Solyc00g005040.3.1.0;Parent=mRNA:Solyc00g005040.3.1

ADD REPLY • link 5.8 years ago by 1234anjalianjali1234 ▴ 50

0

Entering edit mode

Sorry for the late, I have written simple python code which may help you. I have tested the code on the data which you have commented above (.gff format).

This code will take .gff file as command line argument and convert the same into zero-based .bed format (as you were expecting).

#!/usr/bin/env python

## USAGE   : python prgram.py inputfileName.gff <outputFileName.bed>
## EXAMPLE : python prgram.py inputfile.gff
##                        OR
##           python prgram.py inputfile.gff output.bed

import re
import sys
import os.path

try:
    inputFileName = sys.argv[1]
    if(not os.path.isfile(inputFileName)):
        exit(0)
except:
    print("\n **ERROR: Please provide valid input file as commandline argument....!!\n\n **USAGE: python prgram.py inputfileName.gff <outputFileName.bed>\n")
    exit(0)

try:
    outputFileName = sys.argv[2]
except:
    outputFileName = inputFileName + "_zeroBased.bed"

dictGene = {}

geneName = ""
fread = open(inputFileName, "r")

for line in fread:
    line = line.strip()
    if(line == "" or line[0] == "#"):
        continue

    tempList = line.split("\t")

    if(re.match("gene", tempList[2], re.IGNORECASE)):
        regxSearch = re.match(r'ID\=gene\:(\S+)\;', tempList[8], re.IGNORECASE)
        geneName = regxSearch.group(1).split(";")[0]

    if(geneName in dictGene):
        dictGene[geneName].append(tempList[0] + "\t" + tempList[3] + "\t" + tempList[4] + "\t" + geneName + "\t" + tempList[2])
    else:
        dictGene[geneName] = []
        dictGene[geneName].append(tempList[0] + "\t" + tempList[3] + "\t" + tempList[4] + "\t" + geneName + "\t" + tempList[2])

fread.close();

fwrite = open(outputFileName, "w")

for key in sorted(dictGene):
    geneStart = int(dictGene[key][0].split("\t")[1])

    for e in dictGene[key]:
        if(re.search("UTR|CDS", e)):
            fwrite.write(e.split("\t")[0] + "\t" + str(abs(int(e.split("\t")[1]) - geneStart)) + "\t" + str(abs(int(e.split("\t")[2]) - geneStart)) + "\t" + e.split("\t")[4] + "\n")

fwrite.close()

ADD REPLY • link 5.8 years ago by Nitin Narwade ★ 1.6k

0

Entering edit mode

Thank you for your reply Nitin,

I tried this code but its giving me the output without the gene id.

SL3.0ch00   0   314 CDS
SL3.0ch00   399 1460    CDS
SL3.0ch00   0   359 five_prime_UTR
SL3.0ch00   882 975 five_prime_UTR
SL3.0ch00   2513    2600    five_prime_UTR
SL3.0ch00   2689    2697    five_prime_UTR
SL3.0ch00   2698    2787    CDS
SL3.0ch00   2873    2905    CDS
SL3.0ch00   2998    3231    CDS
SL3.0ch00   3232    3237    three_prime_UTR
SL3.0ch00   0   254 three_prime_UTR

ADD REPLY • link 5.8 years ago by 1234anjalianjali1234 ▴ 50

1

Entering edit mode

Just replace,

fwrite.write(e.split("\t")[0] + "\t" + str(abs(int(e.split("\t")[1]) - geneStart)) + "\t" + str(abs(int(e.split("\t")[2]) - geneStart)) + "\t" + e.split("\t")[4] + "\n")

This line in the code by this,

fwrite.write(e.split("\t")[0] + "\t" + str(abs(int(e.split("\t")[1]) - geneStart)) + "\t" + str(abs(int(e.split("\t")[2]) - geneStart)) + "\t" + e.split("\t")[4] + "\t" + key + "\n")

I thought gene names are not required for the Gene Structure Display Server so I haven't included the same in the output.

OR you can just append + "\t" + key + "\n" these contents to fwrite.write() statement.