Extract Domain Sequences From Multiple Sequences
10.5 years ago
Palu ▴ 290

Hi, I have 100 protein sequences with some conserved domains. I want to extract the domain sequences in a go. is it possible. Although CDD gives us the boundry of the domains but didn't give the sequences of the domain. i am a window user.

What do you have as input? Sequences (FASTA or which other format) or a list of accession numbers (Uniprot or which other database)?

Also: do you want the consensus sequence of the conserved domain or the one in your sequences?

Are you and @Moon from Finding The Sequence Of A Domain working on the same assignment?

no we are not working on the same project :).

OK, Thanks! I will trust you on this. By the way, welcome to Biostars.org!

10.5 years ago
Rm 8.1k

If you know the domain boundary coordinates: than its very simple using input multiple sequence fasta file.

1. using blast "formatdb" format your fasta files.
2. use fastacmd with -s sequence name -L start, end :

Example: fastacmd -d refseq_protein -s NP_112245 -L 100,160

input "list_file" file with three columns "seq_id" "start" "end"

   awk '{system("fastacmd -d input_fasta.fa -s "$1" -L "$2","\$3"");}' list_file


10.5 years ago

Have you tried Batch CDD search option ?

To expand on that: if you want the exact hit positions, use the rpsblast command-line tool.

10.5 years ago

actually I have problem with r script. Do you know any perl solution for that?

0
No, but it's very easy to install R (http://cran.cnr.berkeley.edu) also you will like R's IDE (http://rstudio.org) all are available for Linux, Mac, and Windows.