Question

Extract information from prokka gff files

2

Entering edit mode

2.5 years ago

Space_Life ▴ 50

Hi, I have hundreds of prokka annotated gff files. I want to extract ID, product, Uniprot ID and gene name from every gff file. All these information are in the last column of the file when I open it in an excel sheet. I tried converting them into csv and then extracting information, however, it takes time saving files one by one in csv format. Also, csv file coverts it into 9 columns only. The last column has all the required information in one cell. I am new to using R or Python.

Extract the above mentioned information from each gff files. ( Great if I can do bulk operation on all the files together)
Create csv file with the extracted information
bind all files into one long csv file

Kindly suggest me with possible code that could be used to do this. Thank you.

enter image description here

csv Prokka annotation gff • 1.6k views

ADD COMMENT • link updated 2.5 years ago by kashiff007 ★ 1.9k • written 2.5 years ago by Space_Life ▴ 50

3

Entering edit mode

Incase you are using unix, use awk command to extract specific columns. In your case I guess u want to extract the information of only the genes (off file also have the info from exon, intron etc.). This information you will find in third column

awk '$3=="gene"{print$9}' your_file.gff | sed 's/;/\t/g' > new.txt

here,

$3=="gene" : will extract only rows which has gene in the third column
print$9: get the ninth column
sed 's/;/\t/g' : replace ; with a tab (\t), you can also use , for csv file.

ADD REPLY • link 2.5 years ago by kashiff007 ★ 1.9k