Question: How to group features based on ID in gff
1
gravatar for markcharder
3.7 years ago by
markcharder10
Australia
markcharder10 wrote:

Hi all,

What I want to do seems like it should be relatively simple but I lack the programming know-how to do it and have not found any examples anywhere on the internet, so I really hope someone can help!

I currently have a GFF file that is missing parent lines, with examples like so:

sequence  source  EST_match  1  20  .  -  0  ID=ID1;

sequence  source  EST_match  25  30  .  -  0  ID=ID2;

sequence  source  EST_match  35  50  .  -  0  ID=ID2;

sequence  source  EST_match  90  110  .  -  0  ID=ID3;

In this case, the two features labelled ID=2 should be a part of the same parent feature. So what I want to do is remove IDs from features with the same IDs and create a parent feature that appears before them in the gff with coordinates spanning the child feature coordinates, like so:

sequence  source  EST_match  1  20  .  -  0  ID=ID1;

sequence  source  EST_match  25  50  .  -  0  ID=2;

sequence  source  EST_match  25  30  .  -  0  Parent=2

sequence  source  EST_match  35  50  .  -  0  Parent=2

sequence  source  EST_match  90  110  .  -  0  ID=ID3;

I have written something in perl that compares the current line with the previous and next lines and skipps the ID field if lines contain the same IDs but I cannot seem to work out how to create parent IDs.

Looking forward to any suggestions!

Thank you.

Mark

id gff format gff3 • 1.4k views
ADD COMMENTlink modified 3.7 years ago by a.zielezinski8.7k • written 3.7 years ago by markcharder10

I've seen a few different conventions for encoding alignments in GFF3. The first example you gave is perfectly valid (as far as the GFF3 specification is concerned: individual tools may have specific requirements).

sequence  source  EST_match  1   20   .  -  .  ID=ID1;
sequence  source  EST_match  25  30   .  -  .  ID=ID2;
sequence  source  EST_match  35  50   .  -  .  ID=ID2;
sequence  source  EST_match  90  110  .  -  .  ID=ID3;

The two lines with ID=ID2 represent a single alignment. In GFF3, this is called a "multi-feature", a feature that is discontiguous and thus requires multiple entries to specify its structure. You see this frequently with CDS features in protein-coding genes.

Another common convention is as follows.

sequence  source  EST_match  1   20   .  -  .  ID=ID1;
sequence  source  match_part 1   20   .  -  .  Parent=ID1;
sequence  source  EST_match  25  50   .  -  .  ID=ID2;
sequence  source  match_part 25  30   .  -  .  Parent=ID2;
sequence  source  match_part 35  50   .  -  .  Parent=ID2;
sequence  source  EST_match  90  110  .  -  .  ID=ID3;
sequence  source  match_part 90  110  .  -  .  Parent=ID3;

I don't think it's valid (with respect to the Sequence Ontology) to have an EST_match feature be a child of another EST_match feature. Also, the 8th column (phase) is only relevant to CDS features, and should be a period/full stop for all other feature types, not 0.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Daniel Standage3.9k
4
gravatar for a.zielezinski
3.7 years ago by
a.zielezinski8.7k
a.zielezinski8.7k wrote:

Save the following Python script as parentsGFF.py

from itertools import groupby
import re
import sys

fh = open(sys.argv[1])
oh = open(sys.argv[1]+'.out', 'w')

for id, ls in groupby(fh, lambda x: re.split('\s+', x)[8]):
   ls = [re.split('\s+', l.strip()) for l in ls]
   if len(ls)==1:
       oh.write("\t".join(ls[0])+'\n')
   else:
       st = min([int(l[3]) for l in ls])
       en = max([int(l[4]) for l in ls])
       fl = ls[0][:]
       fl[3] = str(st)
       fl[4] = str(en)
       id = fl[8].split('=')[1]
       oh.write("\t".join(fl)+'\n')
       for l in ls:
           l[8] = 'Parent='+id
           oh.write("\t".join(l)+'\n')
fh.close()
oh.close()

 

Input GFF file: test.gff

sequence  source  EST_match  1  20  .  -  0  ID=ID1;
sequence  source  EST_match  25  30  .  -  0  ID=ID2;
sequence  source  EST_match  35  50  .  -  0  ID=ID2;
sequence  source  EST_match  90  110  .  -  0  ID=ID3;

 

Run it:

python parentsGFF.py test.gff

 

Output GFF file: test.gff.out

sequence    source    EST_match    1    20    .    -    0    ID=ID1;
sequence    source    EST_match    25    50    .    -    0    ID=ID2;
sequence    source    EST_match    25    30    .    -    0    Parent=ID2;
sequence    source    EST_match    35    50    .    -    0    Parent=ID2;
sequence    source    EST_match    90    110    .    -    0    ID=ID3;
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by a.zielezinski8.7k
I dont know what is the purpose of this file, but some tools will complain if not all the features (lines) have an ID. So it would be good to add that. Moreover, in similar cases of yours, usualy the childs are called match_part (second row). But I don't think it will affect dowstream analysis.
ADD REPLYlink written 3.7 years ago by Juke-342.2k

Hi, thanks a lot for that, it's almost there! Unfortunately however, it writes the coordinates of the first child sequence as "25 50" in the example, when it should be "25 30". I have no idea how to understand Python as have just started with Perl, sorry.

ADD REPLYlink written 3.7 years ago by markcharder10

Sorry about that. I just modified my answer.

ADD REPLYlink written 3.7 years ago by a.zielezinski8.7k

Hi, thanks again, managed to fix the bug (below). This is a huge problem-solver for me. Cheers,

Mark

ADD REPLYlink written 3.7 years ago by markcharder10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1830 users visited in the last hour