I have a tsv file generated from abricate (https://github.com/tseemann/abricate). I need to convert them to gff3 format with certain columns retained, certain columns reordered, while other columns deleted.
We are trying to use these gff3 files for downstream applications and for piping into other applications. However, we could not solve it.
Below are examples of my tsv files, what possibly we may need to do, and desired output files in gff3 format.
Any help will be much appreciated.
"Petr Ponomarenko" could you please help?
Input tsv file:
FILE SEQUENCE START END GENE COVERAGE COVERAGE_MAP GAPS %COVERAGE %IDENTITY DATABASE ACCESSION PRODUCT
UBird_Cyou_D3.fna BJCZ01000001.1 1866608 1867417 cdtB 1-810/810 =============== 0/0 100 90 vfdb CAD48850 (cdtB) cytolethal distending toxin B [CDT (VF0185)] [Escherichia coli O157:H str. 493/89] UBird_Cyou_D3.fna BJCZ01000001.1 1867414 1868190 cdtA 1-777/777 =============== 0/0 100 90.61 vfdb CAD48849 (cdtA) cytolethal distending toxin A [CDT (VF0185)] [Escherichia coli O157:H str. 493/89] UBird_Cyou_D3.fna BJCZ01000001.1 2245186 2246238 ompA 1-1041/1041 ========/====== 1/12 100 94.11 vfdb AAF37887 (ompA) outer membrane protein A [OmpA (VF0236)] [Escherichia coli O18:K1:H7 str. RS218]
What we may need to do (there may be other ways too, I am not sure):
- Row 1 (always starts with "#") - Need to replace with the string "##gff-version 3"
- Col 1 - get rid of ".fna" and retain other data
- Insert new Col - print the string from Col 11 for all rows
- Col 2 - get rid of entire column
- Insert new Col - print "CDS" for all rows
- Col 3 - retain data
- Col 4 - retain data
- Insert new Col and print "." for all rows
- Insert new Col and print "+" for all rows
- Insert new Col and print "0" for all rows
- Col 5 to Col 10 - get rid of all these columns and data
- Col 11 - delete column
- Col 13 - retain data except "(", ")", "[", "]"
- Add new Col - Starting with "ID=" followed by the string taken from Col 1 and a underscore added (for the example data "UBird_Cyou_D3_") alongwith numerals starting from 1 and incrementing by 1. This column data needs to be appended by "product=" followed by data from the corresponding row of the modified Col 13. The separator between ID string and product string should be ";". After completion this column should be like "ID=UBird_Cyou_D3_1;product=cdtB cytolethal distending toxin B CDT VF0185 Escherichia coli O157:H str. 493/89"
Desired final output (*.gff3) considering the example data:
UBird_Cyou_D3 vfdb CDS 187 756 . + 0 ID=UBird_Cyou_D3_1;product=cdtB cytolethal distending toxin B CDT VF0185 Escherichia coli O157:H str. 493/89