Question

Converting Abricate output (.tsv) to gff3 format

0

Entering edit mode

13 months ago

ghataksnd ▴ 20

Hello Everyone

I have a tsv file generated from abricate (https://github.com/tseemann/abricate). I need to convert them to gff3 format with certain columns retained, certain columns reordered, while other columns deleted.

We are trying to use these gff3 files for downstream applications and for piping into other applications. However, we could not solve it.

Below are examples of my tsv files, what possibly we may need to do, and desired output files in gff3 format.

Any help will be much appreciated.

"Petr Ponomarenko" could you please help?

Input tsv file:

#FILE   SEQUENCE    START   END GENE    COVERAGE    COVERAGE_MAP    GAPS    %COVERAGE   %IDENTITY   DATABASE    ACCESSION   PRODUCT
UBird_Cyou_D3.fna   BJCZ01000001.1  1866608 1867417 cdtB    1-810/810   =============== 0/0 100 90  vfdb    CAD48850    (cdtB) cytolethal distending toxin B [CDT (VF0185)] [Escherichia coli O157:H str. 493/89]
UBird_Cyou_D3.fna   BJCZ01000001.1  1867414 1868190 cdtA    1-777/777   =============== 0/0 100 90.61   vfdb    CAD48849    (cdtA) cytolethal distending toxin A [CDT (VF0185)] [Escherichia coli O157:H str. 493/89]
UBird_Cyou_D3.fna   BJCZ01000001.1  2245186 2246238 ompA    1-1041/1041 ========/====== 1/12    100 94.11   vfdb    AAF37887    (ompA) outer membrane protein A [OmpA (VF0236)] [Escherichia coli O18:K1:H7 str. RS218]

What we may need to do (there may be other ways too, I am not sure):

Row 1 (always starts with "#") - Need to replace with the string "##gff-version 3"
Col 1 - get rid of ".fna" and retain other data
Insert new Col - print the string from Col 11 for all rows
Col 2 - get rid of entire column
Insert new Col - print "CDS" for all rows
Col 3 - retain data
Col 4 - retain data
Insert new Col and print "." for all rows
Insert new Col and print "+" for all rows
Insert new Col and print "0" for all rows
Col 5 to Col 10 - get rid of all these columns and data
Col 11 - delete column
Col 13 - retain data except "(", ")", "[", "]"
Add new Col - Starting with "ID=" followed by the string taken from Col 1 and a underscore added (for the example data "UBird_Cyou_D3_") alongwith numerals starting from 1 and incrementing by 1. This column data needs to be appended by "product=" followed by data from the corresponding row of the modified Col 13. The separator between ID string and product string should be ";". After completion this column should be like "ID=UBird_Cyou_D3_1;product=cdtB cytolethal distending toxin B CDT VF0185 Escherichia coli O157:H str. 493/89"

Desired final output (*.gff3) considering the example data:

##gff-version 3
UBird_Cyou_D3   vfdb    CDS 187 756 .   +   0   ID=UBird_Cyou_D3_1;product=cdtB cytolethal distending toxin B CDT VF0185 Escherichia coli O157:H str. 493/89

Abricate gff3 • 458 views

ADD COMMENT • link updated 9 months ago by Ram 43k • written 13 months ago by ghataksnd ▴ 20