Repeat Relationships In Gff3
2
1
Entering edit mode
13.2 years ago

I just ran repseek on a +/- 100kb BAC sequence and got this output.

Distant.dir     27938   36273   1434    1433    6901    28288-36623-148-2.00    94.491  1240.45 2.00    2       1.00
Distant.dir     47964   55552   2765    2771    4823    48291-55879-127-2.00    97.367  2483.76 2.00    2       1.00

There is a script available from DAWGPAWS to convert repseek output to GFF3 format, so I did and got the following.

$ ./cnv_repseek2gff.pl < repseek.out 
Expecting input from STDIN
seq    repseek    direct_repeat    27938    29372    1240.45    +    .    repseek1:dir
seq    repseek    direct_repeat    36273    37706    1240.45    +    .    repseek1:dir
seq    repseek    direct_repeat    47964    50729    2483.76    +    .    repseek2:dir
seq    repseek    direct_repeat    55552    58323    2483.76    +    .    repseek2:dir

Ok, this format is much more familiar. However, the last column (attributes) is not valid GFF3. This script creates two lines of GFF3 for each line in the repseek output. How are these pairs of features related and what is the proper way to represent that relationship in GFF3?

repeats gff • 2.2k views
ADD COMMENT
2
Entering edit mode
13.2 years ago

I would try coding this using a parent-child relationship, e.g.

seq repseek direct_repeat   27938   37706   .   +   .   ID=repseek1
seq repseek repeat_unit 27938   29372   1240.45 +   .   ID=repseek1.1;Parent=repseek1
seq repseek repeat_unit 36273   37706   1240.45 +   .   ID=repseek1.2;Parent=repseek1
seq repseek direct_repeat   47964   58323   .   +   .   ID=repseek2
seq repseek repeat_unit 47964   50729   2483.76 +   .   ID=repseek2.1;Parent=repseek2
seq repseek repeat_unit 55552   58323   2483.76 +   .   ID=repseek2.2;Parent=repseek2

It looks like the GFF3 validator is currently down (not loading the so.obo file), so I can't tell if this is "valid" GFF3. I'll send them a mail and update this post when it is back online.

ADD COMMENT
0
Entering edit mode

that's a nice way to indicate the relationship given what's available in gff

ADD REPLY
0
Entering edit mode
13.2 years ago
brentp 24k

Looks like that scripts was last edited in 2007. it's probably GFF2 just substitute : for = and it may be valid. regarding the format, this part of the code in cnv_repseek2gff.pl shows what's going on:

my $copy1_start = $rep_parts[1];
my $copy2_start = $rep_parts[2];
my $copy1_end = $copy1_start + $rep_parts[3];
my $copy2_end = $copy2_start + $rep_parts[4];

So the columns are start1, start2, len1, len2. that's how you get two gff rows per repseek row.

EDIT: in response to comment

The relationship is determined by the shared value in the attribute column

ADD COMMENT
0
Entering edit mode

@brentp I understand attribute syntax for GFF3. My question is more focused on how to maintain the relationship between the two features.

ADD REPLY
0
Entering edit mode

doesn't the repseek# indicate the relationsihp?

ADD REPLY
0
Entering edit mode

@brentp Yes, I know the relationship is determined by the shared value in the attribute column, but (back to my original question)...how do I represent that in GFF3 format? These features need to have their own unique ID attributes, so some other attribute is required to maintain the relationship between them. Is there an attribute key (in terms of ontology) that is appropriate here?

ADD REPLY
0
Entering edit mode

@brentp I guess I could just give them all unique ID attributes and then give matching pairs the same Name attribute to maintain the relationship. I'm just wondering if there is a better way of doing this.

ADD REPLY

Login before adding your answer.

Traffic: 1966 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6