I performed denovo assembly with Trinity using reads from heart mouse RNaseq. Than I mapped the transcriptome back to the reference genome with Blat . I also used Kallisto to Count the transcript abbundance in each sample. But now I want to know what Trinity ID's are known already in the annotation and what their names, and what is not annotated. How can I do that?
Now, for a correct answer.
You are doing mice which is a really well annotated organism. Trinity will be imprecise. You would be way better using Cufflinks with the reference genome and annotation, if you are looking for novel isoforms or things like that it will find them for you. Are trying to achieve something in particular?
Cufflinks will also give you the correspondence between its gene names and the official ones.
Disclaimer: I forgot Trinity outputs a fasta file and not a GTF or BED. Bad answer, but might be useful to someone
I had that same issue with another tool (https://github.com/shenkers/isoscm ). Your best bet is to use bedtools/bedops. My scripts are not really portable (working on it, who knows it could be a small methods article) but:
- I use bedtools merge to merge transcripts of the same de novo gene into one single maximum length transcript with no introns (min start position max end position)
- I do the same with the official annotation
- I use bedtools intersect to get a hopefully one to one correspondence
- you need to use the -s (strand specific) flag.
- I check if I have a true one-to-one correspondence: are there unassigned transcripts, and more importantly do I have several genes overlapping the same transcript? If you are unlucky and have very similar sequences close by, you may get fused transcripts where your alignment software misplaces one half of a pair of reads. The assembly software then outputs a single really long gene with lots of introns, instead of separate genes. The alignment software should have an option for maximum intron size you can fiddle (conversely, if it is too short, you split a gene with a large intron into two genes).
- you have different transcripts for each gene due to alternative splicing, polyadenylation, TSS. Merging transcripts resolves this issue for me.
I would like to first take a look at what Cufflinks does since it is pretty good for de novo assembly with a reference. Its 3' UTR are often screwy though. In all cases I use IGV often to look at my reads, and I have a good depth to start with after pooling several biological replicates.