Hello everyone,
I am trying to do get my data into a workable data frame so that I can ultimately run DESeq2 at the gene level. I have started out with my sequencing samples and ran trinity for denovo assembly, which gives a fasta file. I then run salmon to get transcript level abundances against that fasta file and this outputs a quant.sf file per sample. Next I am trying to link transcript names to gene names using the Xenopus_tropicalis gtf file (which ultimate gets put into the tx2gene file). Then I run tximport to try and come up with gene level estimates and this is where the problem starts.
At this point, my tx2gene file has two columns (GENEID) and (TXNAME) and the contents both look normal as you would expect for tropical (ENSXETG.. and ENSXETT....). At the end of the import, I get the following error
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.
------
Example IDs (file): [TRINITY_DN88461_c0_g4_i1, TRINITY_DN88461_c0_g5_i1, TRINITY_DN88461_c0_g6_i1, ...]
------
Example IDs (tx2gene): [ENSXETG00000000002, ENSXETG00000000003, ENSXETG00000000004, ...]
------
This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar' I now realize my de novo assembly is going to be in transcripts names that Trinity uses (TRINITY_DN88461...)
So my question is do I need to somehow annotate the de novo assembled transcriptome and use that instead of the Xenopus tropical gtf? If so, how to I go about obtaining a GTF file off of a de novo transcriptome?
You do not have to run
trinity
if you are not interested in identifying new transcripts beyond what is available in known transcriptome. In that case using the transcriptome sequence from Ensembl forsalmon
analysis (with ENSXETG00000000003 names) will keep thing simple. You can then usetximport
on that output.Note: Ensembl's version of X. tropicalis transcriptome can be downloaded here.
Ah, apologies I forgot to mention. These samples are actually a species of poison dart frog Dendrobates auratus. I only used the gtf of xenopus because it would be closer than using human. Does that change your answer at all or just use the xenopus transcriptome?
If you only have the de novo transcriptome that you assembled using
trinity
then you could simply use it for salmon analysis with default ID's. I don't know how many transcripts you have but you may want to make them non-redundant usingCD-HIT
before you dosalmon
analysis.You could try and do the analysis against Xenopus transcriptome (hopefully the species are evolutionarily close enough ) to get a rough idea about your data but ultimately you will need to go back repeat the analysis against your transcriptome and annotate the transcripts to find out what they are. Here is one post to get you started: Annotating sequences after de-novo Trinity assembly and RSEM analysis...there must be an easier way!