I am currently working on a RNAseq data set from 2 conditions in 1 species. For this species the genome is nice (I have full chromosomes) and well annotated. I used the New tuxedo
pipeline to analyse my results. This worked very well.
I now retrieved some RNAseq data from a very close species with the same 2 conditions. I would like to perform the same analysis on this species and find some up-regulated genes in the first species that are down-regulated in the other species for example. But in this species, the genome is quite bad compared to the first one (400 000 scaffolds).
I though about different ways of doing it:
Do the
New tuxedo
pipeline again using the "bad" genome and then figuring out which gene corresponds to which gene in the first species.Using directly the genome of the first species for the second species (Ok it is different but at least it is good).
Using the transcripts I assembled in the first species to quantify their expression with the reads of the second species (i.e. not using the
new tuxedo
).
I don't know which solution would be better, if you have any ideas, thanks!
EDIT:
The overall alignement rate for the "good" genome RNAseq vs "good" genome: 90%
Overall alignment rate for the "bad" genome RNAseq vs the "good" genome : 47%
Overall alignment rate for the "bad" genome RNAseq vs the "bad" genome : 92%
What about performing de novo transcriptomic assembly for the specie with fragmented genome? rnaspades works fine form me.
I could add this idea to the previous list, but would it be better and why?
Yes exactly as Carlo asked, option 2 depends on how close-related genomes are, probably comparing % of aligned reads? (excluding multihit), but also option 1 would works after a genome refinement (filter redundant scaffolds, low coverage, etc.).
I am aligning on the "good" genome to see if the percentage of uniquely mapped reads is ok.
How close are the two species ? (% genome identity, etc)
You can always try option 3, because it is the easiest, and see how it goes. If most of your reads are unmapped, then I'm afraid that you will have to use option 1 (or de novo transcriptomic assembly).
I edited my post, I align 47% of reads from species 2 on the genome of my first species.
Ok, its not that bad, but not so good either. Now it is up to you to decide:
Is it ok to miss about 50% of information with option 3, knowing that the interpretation of the results will be simpler (you can easily compare differentially expressed genes from species 1 and 2) ? Or do you have the time and ressources to do a more complex analysis involving (A) either the tuxedo pipeline or de novo transcriptome assembly and (B) finding the homologs between species 1 and 2. The second option will obviously take more time but will probably be more accurate.
You can also do both analysis and see how they converge (hopefully) to the same conclusion.