Question

Given gene ID and genomic coordinates, how can I create a GFF formatted file?

0

Entering edit mode

9.8 years ago

Jason ▴ 920

I have downloaded a list of coordinates of yeast genes from Xu et al., 2009 (see table S3). Unfortunately its current format is not a standard format so it does not appear to be compatible with the programs I would like to use i.e. HOMER, bedops or bedtools. I was wondering if anyone could help me get it into a gff format using unix or R (other languages are also welcome if the code is just copy and paste)? I tried to recreate what I saw at the ensembl website, but said programs were still not recognizing it as gff. Here is the beginning of the file (there are actually ~7K lines):

ID    chr    strand    start    end    type    name    commonName    endConfidence    source
ST0001    1    +    9369    9601    SUTs    SUT001    SUT001    bothEndsMapped    Manual
ST0002    1    +    30073    30905    CUTs    CUT001    CUT001    bothEndsMapped    Automatic
ST0003    1    +    31153    32985    ORF-T    YAL062W    GDH3    bothEndsMapped    Manual
ST0004    1    +    33361    34897    ORF-T    YAL061W    BDH2    bothEndsMapped    Manual
ST0005    1    +    35097    36393    ORF-T    YAL060W    BDH1    bothEndsMapped    Manual
ST0006    1    +    36545    37329    ORF-T    YAL059W    ECM1    bothEndsMapped    Manual
ST0007    1    +    37409    39033    ORF-T    YAL058W    CNE1    bothEndsMapped    Manual
ST0008    1    +    39217    41969    ORF-T    YAL056W    GPB2    bothEndsMapped    Manual
ST0009    1    +    42161    42833    ORF-T    YAL055W    PEX22    bothEndsMapped    Manual

HOMER bedtools R GFF unix • 3.0k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Jason ▴ 920

0

Entering edit mode

I am very new to bioinformatics

but looks to me this task could simply done by using regular expression and extract information you need and reformat it?

Delineate chromosome number(chrx), starting and ending by tab(\t)

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by muhe1985 ▴ 20

Ram · Answer 1 · 2014-06-19

Many file formats in bioinformatics - "standard" or otherwise - are simply delimited files which contain a subset of commonly-used columns, in a different order. Learning to recognise and rearrange these columns is, I suppose, a core bioinformatics skill. So rather than tell you how to do it, let's try and get you to think about how to approach the task more generally.

You would like GFF. So first, you consult the document which describes GFF3. It tells you that there are 9 columns: seqid, source, type, start, end, score, strand, phase and attributes. Read carefully and make sure you understand what can go into each one.

Now, look at what you have and consider how best to map that to GFF3. Straight away you can see:

chr    => seqid
source => source
start  => start
end    => end
strand => strand

Looks like your features are genes, so type = "gene". You have neither score nor phase, so both of those can be ".". ID, name and commonName can all be expressed in column 9 as key/value attributes.

Now you just have to rearrange and create the column content. This is quite easy using e.g. awk, Perl, R so that's an exercise for you or another answer.