Question

How can I download seqs from CAZY database?

0

Entering edit mode

3.9 years ago

claudia.d • 0

I'm trying to download all GH29 sequences from CAZY database. It was easy manually for the archaea (just 41 seqs), but the bacterial are more than 4k. How can I do that? My goal is to get all the sequences, calculate a tree and studying gene annotation. I also read about dbCAN2, but I'm not sure I understood at all how it works. Can anyone help me ?

sequence • 1.3k views

ADD COMMENT • link updated 3.9 years ago by GenoMax 141k • written 3.9 years ago by claudia.d • 0

score 0 · Answer 1 · 2020-05-19

Download this file from dbCAN2 here. This link was provided by an answer found here: Download CAZy database

Once you download the file, pull out the sequences for GH29 family using the following code (fasta linearization code by @Pierre):

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < CAZyDB.07312019.fa | grep -A 1 GH29  --no-group-separator | tr "\t" "\n" > GH29_seq.fa

If you want them nicely folded every 60 characters:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < CAZyDB.07312019.fa | grep -A 1 GH29  --no-group-separator | tr "\t" "\n" | fold -w 60 > GH29_seq.fa