Entering edit mode
4.2 years ago
Joel Wallenius
▴
210
Hello!
I googled but found only issues of RNA read quantification, which is fair enough but not the help I would like. I'm just curious what percentage of my DNA reads are within my organism's transcriptome.
I was going to do it with BWA but then I read that BWA is for short reads vs a large genome, and the transcriptome is obviously not as large as the corresponding genome...
Suggestions?
Big thanks in advance!
Joel
If you are aligning to transcriptome why not use
salmon
orkallisto
instead?That said, when possible you should always align to the genome and then account for reads falling in expressed part using a counting program like
featureCounts
.There is no reference genome I'm afraid... the dinoflagellate genome is enormous, I find only bits and pieces of it at NCBI. I have the transcriptome only
The transcriptome is CDS, sadly. I suppose this introduces a risk of false positives as the organism is a eukaryote, with exons all over the place. I don't need exact numbers though, so maybe that's fine. Regardless I don't see what options I have, really. There is no other reference... (I might be able to get my hands on the reads that built the transcriptome though).
Have you looked to see if NCBI has any EST datasets you could potentially use as a stand in?
How would that help? I'm unfamiliar with ESTs but based on what Wikipedia says they're just fragments of cDNA, i.e. they map to transcripts, so they're having the same problem my cDNA sequences in my transcriptome do. :( Am I missing something?
EST's would be better than using single exons/CDS's to count but that is about it. Ideally you should be do an RNAseq project of your own and then assembling your own transcriptome to get more definitive answers. I am curious as to why you just did DNA sequencing or is this actually an RNA sequencing project (RNA --> DNA --> sequenced).
I joined this project late so I can't motivate the reasons why we have the data we have. We have RADseq DNA reads from all over the genome, and now we want to know approximately what percentage of those reads are within coding DNA. I can't think of a better analysis than mapping to CDS or ESTs, despite the flaws.
That is a really unusual application of RADseq data. For a genome that has no genome/transcriptome available. Do the best you can is the only thing to say here.
I'll do that, then. Thanks :-]