Extract information from prokka gff files
0
2
Entering edit mode
2.5 years ago
Space_Life ▴ 50

Hi, I have hundreds of prokka annotated gff files. I want to extract ID, product, Uniprot ID and gene name from every gff file. All these information are in the last column of the file when I open it in an excel sheet. I tried converting them into csv and then extracting information, however, it takes time saving files one by one in csv format. Also, csv file coverts it into 9 columns only. The last column has all the required information in one cell. I am new to using R or Python.

  1. Extract the above mentioned information from each gff files. ( Great if I can do bulk operation on all the files together)
  2. Create csv file with the extracted information
  3. bind all files into one long csv file

Kindly suggest me with possible code that could be used to do this. Thank you.

enter image description here

csv Prokka annotation gff • 1.6k views
ADD COMMENT
3
Entering edit mode

Incase you are using unix, use awk command to extract specific columns. In your case I guess u want to extract the information of only the genes (off file also have the info from exon, intron etc.). This information you will find in third column

awk '$3=="gene"{print$9}' your_file.gff | sed 's/;/\t/g' > new.txt

here,

$3=="gene" : will extract only rows which has gene in the third column
print$9: get the ninth column
sed 's/;/\t/g' : replace ; with a tab (\t), you can also use , for csv file.
ADD REPLY

Login before adding your answer.

Traffic: 1821 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6