Question

Extracting subset from GFF3 file based on sequence headers from a FASTA file

1

Entering edit mode

5.2 years ago

adityabandla ▴ 30

I have a FASTA file of gene's like this

>CS1RZP_12510_67 # 64357 # 65316 # -1 # ID=1311_67;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.675
ATGGCCGTCGTCACCATGAAGCAGATGCTTGATTCCGGCGTGCATTTCGG
GCATCAGACCCGCCGCTGGAACCCGAAGATGAAGCGCTACATCCTGACCG
AGCGCAACGGGATCTACATCATCGACCTGCGGCAGACGCTCACCTACGTC

I would like to extract the corresponding lines for these specific genes from a GFF3 file which contains info on other genes as well which aren't in the above FASTA.

CS1RZP_12510    Prodigal_v2.6.3 CDS 3   617 134.5   -   0   ID=1311_67;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.675;conf=100.00;score=133.83;cscore=137.59;sscore=-3.75;rscore=-3.31;uscore=-4.34;tscore=4.54;
CS1RZPR4D_68    Prodigal_v2.6.3 CDS 1000    1218    56.4    +   0   ID=1_2;partial=01;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.612;conf=100.00;score=56.40;cscore=44.46;sscore=11.94;rscore=7.30;uscore=0.10;tscore=4.54;

I am looking for an AWK solution. Grep using -wf is extremely slow

AWK FASTA GFF3 • 2.3k views

ADD COMMENT • link updated 5.2 years ago by lieven.sterck 15k • written 5.2 years ago by adityabandla ▴ 30

0

Entering edit mode

can you give an example how you are going about it when using grep ?

the IDs in your fasta file seems to be a combination of the geneID in the gff and the contig they are on, correct?

Can we assume that the "# 64357 # 65316 # -1 # ID=1311_67;partial=00;start_type=ATG;rbs_motif " part is present in all fasta headers?

ADD REPLY • link 5.2 years ago by lieven.sterck 15k

0

Entering edit mode

Can we assume that the "# 64357 # 65316 # -1 # ID=1311_67;partial=00;start_type=ATG;rbs_motif " part is present in all fasta headers? Yes, that's right, these are the nucleotide files from prodigal using the -d option

grep -oP '(?<=ID=).*?(?=;)' sample.fa | grep -wf - sample.gff

ADD REPLY • link 5.2 years ago by adityabandla ▴ 30

0

Entering edit mode

Does it have to be in a single cmdline?

first creating the list of IDs you need and then query them from your gff will be more efficient I think

ADD REPLY • link 5.2 years ago by lieven.sterck 15k

0

Entering edit mode

Ideally, looking for a one-liner as this will go into a bigger script. Looking to avoid creating temp files or variables as much as possible

ADD REPLY • link 5.2 years ago by adityabandla ▴ 30

0

Entering edit mode

fair enough, but not making it any easier ;-)

ADD REPLY • link 5.2 years ago by lieven.sterck 15k

score 2 · Accepted Answer · 2020-05-11

2

Entering edit mode

5.2 years ago

adityabandla ▴ 30

Figured it out with a little help from a friend, if anyone finds this useful

awk '/^>/,NR==FNR{split(substr($1,2),a,"_");split($9,b,";");HEADER[a[1]"_"a[2]]=b[1]} \
{split($9,c,";");if($1 in HEADER && c[1] == HEADER[$1]) print $0}' sample.fa sample.gff

ADD COMMENT • link 5.2 years ago by adityabandla ▴ 30