Dear expert BIOSTARS friends, Hi (I am not native in English/be ready for mistakes)
I have done a de novo RNA-seq on a non-model fish gonad (in both sexes)
Then I have performed de novo assembly (Trinity) and DEG analysis and now I have the up-regulated genes in males and in-females.
Now, I want to find out that these genes are representative of which pathways in each sexes ? other to say :
may DE Genes ---> GO term -----> KEGG -------> (and may be some) statistically significantly enriched pathways | strategy ?
but the problem is this that this fish is a non-model animal and I can not use the genes of it as "background" for example in KOBAS (maybe I can use Zebrafish data?)
and the second problem is this that when I perform blastx against NCBI nr I gain the gene ID as below that it seems it is useless in most pathway recognition platforms (same as DAVID), :
the blast output:
one of my DE transcript name ................................................... the blastx result gene/protein
TRINITY_GG_4_c0_g1_i1 -------------------------------------------------------- gi|1020474214|ref|XP_016142334.1|
Thank you in advance!
Dear Carlo Yague, first I must appreciate your answer,
Secondly, about "first blast all the genes against the Zebrafish and create a table of homology"
I think you mean that I must download the zebrafih genes (from where ?) and make it a blastable database and used my
600000 de novo assembled transcripts of my non-model species as a query and run the blastn (is it right to use blastn?)
And dou you have any prefered software for GO, KEGG and Pathway enrichment in this this case ?
Thank you again and long to hear from you!
Yes, this is what I mean. You can download zebrafish fasta files on Ensembl for instance. Then you can use blastn but it might be better to use blastx since conservation is usually stronger at the protein level. For GO analysis, I like amigo2. There are other interesting suggestions in this recent thread.
Thanks. I really appreciate the time you have spent.
Which files do you suggest from the Zebrafish section in the Ensembl website you have provided for bastn and blastX ?
below is the files that are available in FASTA format for zebrafish in Ensembl:
DNA (FASTA) / cDNA (FASTA) / CDS (FASTA) / ncRNA (FASTA) / Protein sequence (FASTA)
I think it must be a protein database for blastX, right ?
And do you have any preferred blast e-value for these relatively small datasets (zebrafish proteins and my transcriptomes) ?
I think so too.
You could rather use the e-value because it takes into account the size of the dataset. An e-value threshold of 5% seems like a good starting point to me.
Sorry I mean e-value and I wrote p-value mistakenly.
So I am going to download "Protein sequence (FASTA)" from Ensemble and then make it blastable database and
then blastX my assmbly.fasta as query against it.
1- is it correct ?
2- does the ID that I gain from the blast results is acceptable for software such as KOBAS ?
3- in this manner that you have learned me, which one I must use as "background :
the Zebrafish Protein sequence
or Zebrafish DNA ?
Have you thought about how you are going to reconcile the ~600K results? You not only need to worry about the e-value but also the coverage (how long are the HSP's) in the hits. Ideally you want fuller length the better.
In a way it is good that you have gotten around to doing this since it will tell you how many unique entities there are out of those 600K.
Hi Dear genomax2,
Yes you are right but I intend to find the enriched Go terms and Pathway in my ~300 DEG, first.
As an expert are you agree with this strategy we have spoken above? blasting the DEGs sequence against Zebrafish Ensemble protein instead of NCBI nr ?
The problem is this that the Gene/Protein ID which NCBI nr blastX is provided (TRINITY_GG_4_c0_g1_i1 ----- gi|1020474214|ref|XP_016142334.1|) for my de novo transcripts is not accepted in DAVID and KOBAS,
And I do not know how to take my steps after the blasting of my DEGs to rich to enriched pathways.
I will appreciate any suggestions!
I see.
How many hits did you collect from your nr blastx? Perhaps there were other hits that had real gene names you could use.
If you are only doing 300 you could do them in batches at NCBI. Choose to limit your search to Danio rerio using "Organism" box.
Doing only the 300 DEG is ok if you just want to have an idea of the processes affected. But IMO, for genuine GO enrichment analysis, you need to do them all in order to have the right background set.
I agree. How many of those 600K are mappable/real is something I have tried to get @Farbod to think about/work on across many threads :-)
Dear @Carlo and @Genomax, Hi
I have run blastX of my whole transcripts against Zefrafish Ensembl protein database with the evalue = 1e-6 and outfmt=6.
There is about 129337 hit for my dataset.
the head of results is same as below :
TRINITY_DN212724_c0_g1_i1 ... ENSDARP00000061738 ... 6e-66 209
TRINITY_DN212791_c0_g1_i1 ... ENSDARP00000015862 ... 4e-59 189
TRINITY_DN212791_c0_g2_i1 ... ENSDARP00000015862 ... 6e-65 205
TRINITY_DN212749_c0_g2_i1 .... ENSDARP00000137663 ... 3e-12 64.7
TRINITY_DN211975_c0_g1_i1 ... ENSDARP00000029733 ... 1e-34 131
So, I think these whole 129337 Ensembl IDs is my "background" and WHAT IS THE NEXT STEP ?
Do I must use the DEGs that have Zebrafish Ensemble protein hits as query and use some tools same as KOBAS ?
(may be any other strategy/tools that you prefer?)
You should make a non-redundant set of those 129337 ID's which should be easy enough to do with
sort | uniq
command in unix.On a different note you can see that (since I don't see the complete ID's I will use this as an example)
These two different transcripts seem to hit the same gene so they are subsets of each other and should be condensed into one (if possible). This is what I have meant all along that your 600K set may boil down to 1/10th the size if you do this exercise.
Dear genomax,
Thanks for all your helps
The "600K set may boil down to 1/10th the size" could be done for ~100,000 transcripts that **have hits
with zebrafish proteins**, but How can I try the same practice for ~500,000 other transcripts ?
Interesting.
If ~500K transcripts have no hits to zebrafish proteins then many of those may be artifacts of assembly (seems unlikely that your fish has a vastly different/large array of transcripts/genes than zebrafish).
I recall you saying that you had already tried CD-HIT-EST analysis on this set elsewhere. Is that correct? If not, that would be one thing to try on all 600K.
Hi, and yes,
I have used CD-hit but it reduced only ~100,000 of those transcripts if i remember correctly. So if
about 100,000 of them have hit with Zebrafish proteins, I will have about 300,000 more
transcripts. If I use NCBI nr for annotating them, then I can not use them in any KEGG or
Pathway analysis.
The matter of using Zebrafish database is this that this fish is same as mouse among the fishes and it is very well annotated.
I even think about using other fishes and even human protein database from ensemble and run a blastX against them. but for example the Gar fish has some thousand hits more than zebrafish and human has some hundred less than zebrafish.
So, what I must do with these ~300.000 !
How many of your DEG's are in the "unknown" pool at the moment? You could put some additional effort towards annotating that set first.
If you are truly interested in annotating the rest of "unknowns" then it would take several rounds of careful blasting/domain searching/alignments etc to make sense of them. This would be much more a manual intensive, than an automated process. If you have the time/energy to do it then by all means.
They should also be verified in independent samples (by RT-PCR etc) that they are real and present.
Yes, you should try something like that :)
Hi,
For sites like David, the Ensemble Gene ID but not Protein IDs is acceptable, So I have used BioDBnet to
convert my Ensebmle Protein IDs to Ensenml Gene IDs but it seems that there is a limitation for the number on entry for this converting process.
How I can convert my ~100,000 IDs in BioDBnet ?
Thank you in advance
I don't know BioDBnet but you can do that easily in Ensembl biomart. Select dataset danio rerio genes, then go to filter -> Input external references ID list -> Ensembl protein ID(s), then to results.
But the number of genes might be a limitation here too so you can try to install biomart-R and do the querry locally.
Dear Carlo, Hi.
That works!
Now I have about 22178 unique Ensemble Gene Ids that are gained from converting Ensembl Pr IDs using Ensebml biomart and I have used them in Gorilla as "background list".
Then I will find the Ensemble Gene IDs of my DEGs transcripts and use them as "target list" to check for any pathway enrichment.
You can get this file from ensembl and parse out the gene ID's that correspond to your ENSDA#.
Dear genomax2, Hi
Now I have my background list (about 20,000 unique Ensembl protein/gene IDs that has resulted from blastx-ing my all de dovo transcripts against Ensembl zebrafish protein),
but I do not know how I must use my "target list" for pathway enrichment purposes:
Imagine I have about 60 Ensemble unique proteins that are upregulated in female samples and
about 40 Ensemble unique proteins that are upregulated in male samples (these are my DEG
transcriptomes that I have discovere using edgeR/DESeq2 from my de novo assembly and then I
have blastx-ed them against zebrafish Ensembl protein database).
.
Now which one I must use as target list in GOrilla or David ?
1- both males and females DEGs (100 IDs) ? (if yes, how I can conclude about the male and female separately?)
2- one time for males (40 IDs) and one time for females (60 IDs) ? (if No. 2 is correct) How I could merge the results of two pathway or GO enrichment ?
Thank you for all your supports