I have several mRNA-Seq datasets from mixed bacterial communties, so metatranscriptome data. I am only interested in a single species in these communties and would like to find out which transcripts in the community belong to this species.
My approach so far was to do a transcriptome assembly. I then tried using blast to align the assembled transcripts to the a fasta file with reference transcripts (acquired from the annotaed genome).
I am a little skeptical regarding the results. If i combine the length of all the assembled transcripts that produce a hit (harsh e-value cutoff), its much longer than all the transcripts of the organism combined. Is there a possible reason for this (other than the assembly/alignment not working at all)?
Do i have the proper approach for what i am trying to find out, or did i make a mistake along the way? Should i align the reference transcripts to the assembled transcripts or vice-versa?
Thanks in advance for any help!
Did you try to use KRAKEN to find them ?
Thank you for the answer. Maybe that is the best approach. I am building the kraken database right now. Will report back on how well that works out.
Would you recommend classifying reads with kraken and then making a transcript assembly from the ones that interest me? Or would i assemble transcripts from all reads and then try to classify those transcripts?
I would recommned you that, at least my approach, firstly classify the reads using kraken and then assembled them. After classifying them, whichever species you want to extract from the reads can be filtered based on taxonomy id in the file.