Correcting coordinates in gff3 file
1
0
Entering edit mode
3 months ago
hulu • 0

I have gff3 files for several genomes where for some annotation features, the coordinates are one base more than the sequence length. When the file is uploaded to some tools, its giving error saying that the length of the contig is more than sequence length.

bgc_001 GN  contig  1   2606    .   .   .   ID=AQ2860336_000071;Name=AQ2860336_000071
bgc_002 GN  contig  1   26678   .   .   .   ID=AQ2860336_000072;Name=AQ2860336_000072
bgc_003 GN  contig  1   22812   .   .   .   ID=AQ2860336_000073;Name=AQ2860336_000073

bgc_001 .   cluster 1   2606    .   .   .   id=bgc_001.region001
bgc_001 .   element 1   2607    .   +   .   contig_edge='True'
bgc_001 .   region  1   2607    .   +   .   candidate_cluster_numbers='1'

bgc_002 .   cluster 1   26678   .   .   .   id=002.region001
bgc_002 .   element 1   26679   .   +   .   contig_edge='True';
bgc_002 .   region  1   26679   .   +   .   candidate_cluster_numbers='1'

bgc_003 .   cluster 643 22812   .   .   .   id=bgc_003.region001
bgc_003 .   element 643 22812   .   +   .   contig_edge='False'
bgc_003 .   region  643 22812   .   +   .   candidate_cluster_numbers='1'

In the above sample table, there are three contigs - bgc_001, bgc_002 and bgc_003. for bgc_001, the sequence length is 2606. But, for annotation feature "element" and "region" , the end is 2707. Similarly for bgc_002, its the same where the sequence length is 26678 but for "element" and "region" , the end is 26679. For bgc_003, everything looks good.

How can i read sequence length from top rows and check end for features "element" and "region" . If its more than the sequence length, correct it to sequence maximum length.

for example, the output will be as

bgc_001 GN  contig  1   2606    .   .   .   ID=AQ2860336_000071;Name=AQ2860336_000071
bgc_002 GN  contig  1   26678   .   .   .   ID=AQ2860336_000072;Name=AQ2860336_000072
bgc_003 GN  contig  1   22812   .   .   .   ID=AQ2860336_000073;Name=AQ2860336_000073

bgc_001 .   cluster 1   2606    .   .   .   id=bgc_001.region001
bgc_001 .   element 1   2606    .   +   .   contig_edge='True'
bgc_001 .   region  1   2606    .   +   .   candidate_cluster_numbers='1'

bgc_002 .   cluster 1   26678   .   .   .   id=002.region001
bgc_002 .   element 1   26678   .   +   .   contig_edge='True';
bgc_002 .   region  1   26678   .   +   .   candidate_cluster_numbers='1'

bgc_003 .   cluster 643 22812   .   .   .   id=bgc_003.region001
bgc_003 .   element 643 22812   .   +   .   contig_edge='False'
bgc_003 .   region  643 22812   .   +   .   candidate_cluster_numbers='1'
awk unix gff3 python • 671 views
ADD COMMENT
0
Entering edit mode

Should be doable with an array, but before overthinking this: To me, it looks like an issue with 0 and 1-based coordinates if the delta is always just 1? In that case, I personally would aim at shifting all entries in the GTF by -1, rather than selectively shortening those that are too long?

ADD REPLY
0
Entering edit mode

Not all features are incorrect. Other features represent correctly except for these two. I have only given few features and short file as example. Can you point me how to use array to solve this.

ADD REPLY
0
Entering edit mode

That is the point. You don't notice, because only some exceed the maximum length, but if you used different tools, one of them might work with 0 and one with 1-based coordinates. But either way, I have posted a possible solution for the approach you envisioned.

ADD REPLY
2
Entering edit mode
3 months ago

Suppose your file is called example.gff, you can use

awk 'NR==FNR {if($3=="contig") a[$1]=$5; next} {if($5>a[$1]){$5=a[$1]}; print}' example.gff example.gff

The logic is as follows:

  • NR==FNR is true for the first pass of the file. Therefore, on the first pass if($3=="contig") a[$1]=$5; next is evaluated. If column $3 equals "contig", it stores the value of $5 (the maximum length) in an array a with the key $1 - the name of the contig. next skips to the next record, so the second action is not performed.
  • On the second pass, it is checked if any feature is longer than the length of the contig with {if($5>a[$1]){$5=a[$1]}. If so, the value gets updated with the stored maximum length from the array. In either case, the line is printed with print.
ADD COMMENT
0
Entering edit mode

Thank you for the code and explanation. I understood the logic of the code. what is the reason or logic for providing input file twice?

ADD REPLY
0
Entering edit mode

To account for entries that may appear before the respective contig record. In your example, all contig records were located directly at the top, but you didn't explicitly state that this applies to all of your files. Hence, it seemed less error-prone to read the whole file first to gather all contig information and then start replacing malformed entries on a second pass.

Usually, awk is so fast that it doesn't really matter if a file is read twice, but you are right, the second pass is not strictly required.

ADD REPLY

Login before adding your answer.

Traffic: 3131 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6