Question

Three questions about an RNA-seq and protein domains data analysis.

1

Entering edit mode

7.5 years ago

utsafar ▴ 80

I am working on saffron. my goal is to find candidate resistance genes in saffron. since saffron genome is not sequenced, I used its RNA-seq. I de novo assembled RNA-seq data using trinity in galaxy. then, again in galaxy, using tblastn, with E value 0.00000000001 and Minimum query coverage per hsp 70%, I found contigs that were similar to 112 reference plant resistance proteins.

First question: What you think about my approach? What are your better ideas for finding this genes in saffron RNA-seq?

I extracted longest ORFs of hit contigs and checked compared domains in those ORFs with domains in reference resistance genes using pfam. some ORFs have more domains than their similar reference genes.

Second question: How can I search domains in 700 ORFs and 112 genes in one step and not one by one?

Third question: How can I be sure about my annotations when some ORFs have additional domains that similar reference proteins don't have those domains.

Thank you All

RNA-Seq protein-domains Plant-Resistance-Genes • 1.9k views

ADD COMMENT • link updated 7.5 years ago by cschu181 ★ 2.8k • written 7.5 years ago by utsafar ▴ 80

score 1 · Answer 1 · 2016-11-06

1.) I think your approach is valid, but I would try the following to possibly improve outcomes. Use your extracted ORFs, translate them into proteins, use blastp against plant resistance proteins. In addition to that, scan your sequences with pfamscan.pl/hmmscan (HMMer software) using Pfam-A as database then check the output for NB-ARC, LRR, TIR/CC domains (might have to check the correct spelling in the output). You can also run NLR-Parser (Steuernagel et al, 2015 Bioinformatics (http://bioinformatics.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=25586514)).

2.) Merge your sequence files, ORFs and gene sequences or use Galaxy's multiple input files option (most tools should have that).

3.) If you find additional domains, especially at 3' of the (CC|TIR)_NB-ARC_LRR domain group, this could mean that there is a domain-fusion event present (also see Sarris et al, 2016 Genome Biology (https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-016-0228-7)). In addition, it is theoretically possible that there is some other domain present instead of the CC|TIR domain (I don't have a reference for this right now).