Forum: From DEGs of a non-model animals to statistically significantly enriched pathways
1
gravatar for Farbod
2.8 years ago by
Farbod3.3k
Toronto
Farbod3.3k wrote:

Dear expert BIOSTARS friends, Hi (I am not native in English/be ready for mistakes)

I have done a de novo RNA-seq on a non-model fish gonad (in both sexes)

Then I have performed de novo assembly (Trinity) and DEG analysis and now I have the up-regulated genes in males and in-females.

Now, I want to find out that these genes are representative of which pathways in each sexes ? other to say :

may DE Genes ---> GO term -----> KEGG -------> (and may be some) statistically significantly enriched pathways | strategy ?

but the problem is this that this fish is a non-model animal and I can not use the genes of it as "background" for example in KOBAS (maybe I can use Zebrafish data?)

and the second problem is this that when I perform blastx against NCBI nr I gain the gene ID as below that it seems it is useless in most pathway recognition platforms (same as DAVID), :

the blast output:

one of my DE transcript name ................................................... the blastx result gene/protein

TRINITY_GG_4_c0_g1_i1 -------------------------------------------------------- gi|1020474214|ref|XP_016142334.1|

Thank you in advance!

ADD COMMENTlink modified 2.8 years ago by Carlo Yague4.5k • written 2.8 years ago by Farbod3.3k
1
gravatar for Carlo Yague
2.8 years ago by
Carlo Yague4.5k
Belgium
Carlo Yague4.5k wrote:

but the problem is this that this fish is a non-model animal and I can not use the genes of it as "background" for example in KOBAS (maybe I can use Zebrafish data?)

I think you should first blast all the genes against the Zebrafish and create a table of homology between the two species. Only the Zebrafish genes that have an homolog in your fish should be considered as background.

and the second problem is this that when I perform blastx against NCBI nr I gain the gene ID as below that it seems it is useless in most pathway recognition platforms (same as DAVID).

First you probably should blast on the zebrafish (taxid:7955) and not nr. The ID in the output is not a gene ID but a protein ID... I'm not sure on how to convert it however.

ADD COMMENTlink written 2.8 years ago by Carlo Yague4.5k
1

Dear Carlo Yague, first I must appreciate your answer,

Secondly, about "first blast all the genes against the Zebrafish and create a table of homology"

I think you mean that I must download the zebrafih genes (from where ?) and make it a blastable database and used my

600000 de novo assembled transcripts of my non-model species as a query and run the blastn (is it right to use blastn?)

And dou you have any prefered software for GO, KEGG and Pathway enrichment in this this case ?

Thank you again and long to hear from you!

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Farbod3.3k
1

Yes, this is what I mean. You can download zebrafish fasta files on Ensembl for instance. Then you can use blastn but it might be better to use blastx since conservation is usually stronger at the protein level. For GO analysis, I like amigo2. There are other interesting suggestions in this recent thread.

ADD REPLYlink written 2.8 years ago by Carlo Yague4.5k
1

Thanks. I really appreciate the time you have spent.

Which files do you suggest from the Zebrafish section in the Ensembl website you have provided for bastn and blastX ?

below is the files that are available in FASTA format for zebrafish in Ensembl:

DNA (FASTA) / cDNA (FASTA) / CDS (FASTA) / ncRNA (FASTA) / Protein sequence (FASTA)

I think it must be a protein database for blastX, right ?

And do you have any preferred blast e-value for these relatively small datasets (zebrafish proteins and my transcriptomes) ?

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Farbod3.3k

I think it must be a protein database for blastX, right ?

I think so too.

And do you have any preferred blast p-value for these relatively small datasets (zebrafish proteins and my transcriptomes) ?

You could rather use the e-value because it takes into account the size of the dataset. An e-value threshold of 5% seems like a good starting point to me.

ADD REPLYlink written 2.8 years ago by Carlo Yague4.5k
1

Sorry I mean e-value and I wrote p-value mistakenly.

So I am going to download "Protein sequence (FASTA)" from Ensemble and then make it blastable database and

then blastX my assmbly.fasta as query against it.

1- is it correct ?

2- does the ID that I gain from the blast results is acceptable for software such as KOBAS ?

3- in this manner that you have learned me, which one I must use as "background :

the Zebrafish Protein sequence
or Zebrafish DNA ?

ADD REPLYlink written 2.8 years ago by Farbod3.3k
2

Have you thought about how you are going to reconcile the ~600K results? You not only need to worry about the e-value but also the coverage (how long are the HSP's) in the hits. Ideally you want fuller length the better.

In a way it is good that you have gotten around to doing this since it will tell you how many unique entities there are out of those 600K.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by genomax68k
1

Hi Dear genomax2,

Yes you are right but I intend to find the enriched Go terms and Pathway in my ~300 DEG, first.

As an expert are you agree with this strategy we have spoken above? blasting the DEGs sequence against Zebrafish Ensemble protein instead of NCBI nr ?

The problem is this that the Gene/Protein ID which NCBI nr blastX is provided (TRINITY_GG_4_c0_g1_i1 ----- gi|1020474214|ref|XP_016142334.1|) for my de novo transcripts is not accepted in DAVID and KOBAS,

And I do not know how to take my steps after the blasting of my DEGs to rich to enriched pathways.

I will appreciate any suggestions!

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Farbod3.3k
1

I see.

How many hits did you collect from your nr blastx? Perhaps there were other hits that had real gene names you could use.

If you are only doing 300 you could do them in batches at NCBI. Choose to limit your search to Danio rerio using "Organism" box.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by genomax68k
1

Doing only the 300 DEG is ok if you just want to have an idea of the processes affected. But IMO, for genuine GO enrichment analysis, you need to do them all in order to have the right background set.

ADD REPLYlink written 2.8 years ago by Carlo Yague4.5k

I agree. How many of those 600K are mappable/real is something I have tried to get @Farbod to think about/work on across many threads :-)

ADD REPLYlink written 2.8 years ago by genomax68k
1
  1. Yes this is the idea, but as genomax pointed out, finding the right filters might be tricky and require some testing.
  2. I don't know but you should be able to convert the ID anyway (Ensembl biomart is great for that).
  3. If you blast on the protein sequences, your background will be the proteins IDs of the Zebrafish that have homologs in your fish of interest. Of course you will have to define what is a homolog (see 1-).
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Carlo Yague4.5k
1

Dear @Carlo and @Genomax, Hi

I have run blastX of my whole transcripts against Zefrafish Ensembl protein database with the evalue = 1e-6 and outfmt=6.

There is about 129337 hit for my dataset.

the head of results is same as below :

TRINITY_DN212724_c0_g1_i1 ... ENSDARP00000061738 ... 6e-66 209

TRINITY_DN212791_c0_g1_i1 ... ENSDARP00000015862 ... 4e-59 189

TRINITY_DN212791_c0_g2_i1 ... ENSDARP00000015862 ... 6e-65 205

TRINITY_DN212749_c0_g2_i1 .... ENSDARP00000137663 ... 3e-12 64.7

TRINITY_DN211975_c0_g1_i1 ... ENSDARP00000029733 ... 1e-34 131

So, I think these whole 129337 Ensembl IDs is my "background" and WHAT IS THE NEXT STEP ?

Do I must use the DEGs that have Zebrafish Ensemble protein hits as query and use some tools same as KOBAS ?

(may be any other strategy/tools that you prefer?)

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Farbod3.3k
1

You should make a non-redundant set of those 129337 ID's which should be easy enough to do with sort | uniq command in unix.

On a different note you can see that (since I don't see the complete ID's I will use this as an example)

TRINITY_DN212791_c0_g1_i1 ... ENSDARP00000015862 ... 4e-59 189
TRINITY_DN212791_c0_g2_i1 ... ENSDARP00000015862 ... 6e-65 205

These two different transcripts seem to hit the same gene so they are subsets of each other and should be condensed into one (if possible). This is what I have meant all along that your 600K set may boil down to 1/10th the size if you do this exercise.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by genomax68k
1

Dear genomax,

Thanks for all your helps

The "600K set may boil down to 1/10th the size" could be done for ~100,000 transcripts that **have hits

with zebrafish proteins**, but How can I try the same practice for ~500,000 other transcripts ?

ADD REPLYlink written 2.8 years ago by Farbod3.3k

Interesting.

If ~500K transcripts have no hits to zebrafish proteins then many of those may be artifacts of assembly (seems unlikely that your fish has a vastly different/large array of transcripts/genes than zebrafish).

I recall you saying that you had already tried CD-HIT-EST analysis on this set elsewhere. Is that correct? If not, that would be one thing to try on all 600K.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by genomax68k
1

Hi, and yes,

I have used CD-hit but it reduced only ~100,000 of those transcripts if i remember correctly. So if

about 100,000 of them have hit with Zebrafish proteins, I will have about 300,000 more

transcripts. If I use NCBI nr for annotating them, then I can not use them in any KEGG or

Pathway analysis.

The matter of using Zebrafish database is this that this fish is same as mouse among the fishes and it is very well annotated.

I even think about using other fishes and even human protein database from ensemble and run a blastX against them. but for example the Gar fish has some thousand hits more than zebrafish and human has some hundred less than zebrafish.

So, what I must do with these ~300.000 !

ADD REPLYlink written 2.8 years ago by Farbod3.3k
1

How many of your DEG's are in the "unknown" pool at the moment? You could put some additional effort towards annotating that set first.

If you are truly interested in annotating the rest of "unknowns" then it would take several rounds of careful blasting/domain searching/alignments etc to make sense of them. This would be much more a manual intensive, than an automated process. If you have the time/energy to do it then by all means.

They should also be verified in independent samples (by RT-PCR etc) that they are real and present.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by genomax68k

Do I must use the DEGs that have Zebrafish Ensemble protein hits as query and use some tools same as KOBAS ?

Yes, you should try something like that :)

ADD REPLYlink written 2.8 years ago by Carlo Yague4.5k
1

Hi,

For sites like David, the Ensemble Gene ID but not Protein IDs is acceptable, So I have used BioDBnet to

convert my Ensebmle Protein IDs to Ensenml Gene IDs but it seems that there is a limitation for the number on entry for this converting process.

How I can convert my ~100,000 IDs in BioDBnet ?

Thank you in advance

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Farbod3.3k
1

I don't know BioDBnet but you can do that easily in Ensembl biomart. Select dataset danio rerio genes, then go to filter -> Input external references ID list -> Ensembl protein ID(s), then to results.

But the number of genes might be a limitation here too so you can try to install biomart-R and do the querry locally.

ADD REPLYlink written 2.8 years ago by Carlo Yague4.5k

Dear Carlo, Hi.

That works!

Now I have about 22178 unique Ensemble Gene Ids that are gained from converting Ensembl Pr IDs using Ensebml biomart and I have used them in Gorilla as "background list".

Then I will find the Ensemble Gene IDs of my DEGs transcripts and use them as "target list" to check for any pathway enrichment.

ADD REPLYlink written 2.8 years ago by Farbod3.3k
1

You can get this file from ensembl and parse out the gene ID's that correspond to your ENSDA#.

ADD REPLYlink written 2.8 years ago by genomax68k
1

Dear genomax2, Hi

Now I have my background list (about 20,000 unique Ensembl protein/gene IDs that has resulted from blastx-ing my all de dovo transcripts against Ensembl zebrafish protein),

but I do not know how I must use my "target list" for pathway enrichment purposes:

Imagine I have about 60 Ensemble unique proteins that are upregulated in female samples and

about 40 Ensemble unique proteins that are upregulated in male samples (these are my DEG

transcriptomes that I have discovere using edgeR/DESeq2 from my de novo assembly and then I

have blastx-ed them against zebrafish Ensembl protein database).

.

Now which one I must use as target list in GOrilla or David ?

1- both males and females DEGs (100 IDs) ? (if yes, how I can conclude about the male and female separately?)

2- one time for males (40 IDs) and one time for females (60 IDs) ? (if No. 2 is correct) How I could merge the results of two pathway or GO enrichment ?

Thank you for all your supports

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Farbod3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2011 users visited in the last hour