Searching For Specific Protein Domains In Rna-Seq
10.8 years ago


I'm trying to find specific protein domains in RNA-Seq contigs to identify previously unknown isotypes. The tool which seems to be most promising for this task is rpsblast.

So my questions are:

  1. Do you know of an already established pipeline for this task?
  2. If not, my idea is to cut the available sequence information by first applying a BLAST search with homologues and loose settings and then using CD-search on the remaining ones. Is that a good approach or are there alternatives?
  3. Are there any (automatic) pre- or postprocessing steps you would recommend?
  4. Are there any ideas on how to extend contigs with broken domains? BLASTing against the reads would be an option, but there maybe is a tool available.
Can you give details on how many contigs approx. you have, if you have a reference sequence and what kind of protein domains you are looking for? That might help to give a better answer.

10.7 years ago

Protein domains, so this sounds like a job for a PFAM search or best an InterProScan.

I assume you have no reference sequence, RNA-seq reads should be assembled into contigs to reduce the query size. The programs need protein sequences, therefore you have to translate your sequence in all 6 reading frames.

You can use transeq from the EMBOSS suite for translation. Both tools are available as web-services and for local installation.

Some more or less vague ideas to help reduce compute load:

  • Align the reads to a reference of a closely related organism if you don't have a reference genome
  • Blast against EST databases
  • If you get a good hit by this method you can remove the contigs from further analysis
  • Apply RepeatMasker and dust (this has most likely been done already)
  • if you have longer contigs you can probably restrict the search to all ORFs instead of full 6-frame translation
  • restrict the InterPro search to those tools and models that are relevant
  • if your reads are strand specific you could maybe go with 3 sense reading frames instead of all 6
Thank you for your answer. As mentioned, I already have the contigs. My plan is to use a conserved domain search (either against CDD, PFAM, SMART, or any of the 10 other databases). InterProScan seems to be a good alternative for a low number of potentially matching contigs, but is hardly applicable in my case.

Hi Michael, I don't understand completely why not. Maybe, you have a large number of contigs? If the web-server has some restrictions you can still install the tool locally. The download link is on the same site. Of course this will require 'in-house' compute resources of some sort. You can start with a fraction of your data to estimate how long it will take. Using a blast 'pre-filter' to sort out the low-hanging fruits is possibly good idea.

Good point, I thought it's a web-only application. I'll look into it.


