I am trying to get the read counts for DESeq2 analysis from meta-genomic data. I have assembled contigs using Trinity for all organisms and I would like to map my reads for each sample to these contigs and get the read counts for DESeq2 analysis. Normally for RNAseq we would use GFF file to annotate the read and annotate as a loci, but for metagenomic data, I can't use one specific genome, so I wanted to use Trinity assembled contigs as reference for mapping. However, before proceeding with the read mapping, I would like to annotate each contigs from Trinity. I wonder if I can do BLAST search against nr
. What would be the easiest way to do this? Thanks for your help!
To get counts for each, you don't strictly need to identify them up-front. You could identify the DE ones first and only ID those :-)
You could follow these directions from Trinity for identification.
Edit: Since this is a metagenomic dataset these directions are not useful.
That is right, I was planning to do the way you have suggested, but then identifying the DE ones later would be a bit elaborate process. I thought identifying in the beginning would reduce the work later.
So rather than identification per se you are looking to reduce redundancy so you don't have the same sequence represented multiple times?
Did you use
TriMetAss
(http://microbiology.se/software/trimetass/ ) instead of Trinity? That appears to be for metagenomic data.No, these are not overlapping sequences so I wanted to map them to the assembled reference. I haven't used TriMetAss, but will give it a try. Thanks!
Additionally, I just wanted to get the loci identified (as which gene,CDS etc) for each cluster of reads after mapping.
Since this is bacterial data you would expect the entire sequence to be coding. It may not be full length or start at the
ATG
depending on how well the assembly worked.As suggested it should be ok to search using
DIAMOND
againsrnr
(or RefSeq bacterial database) to identify the contigs. It works well but you would need ~80-100G of RAM for this search. You could also trymagicblast
from NCBI.Thanks! I have used Diamond before so yes it makes sense.
Out of sheer curiosity: What was your rationale to use trinity? My apologies in case this is question is merely based on my inexperience with trinity: Why would you blast contigs against nr? Or do you get proteins? Is trinity able to define gene boundaries in prokaryotic RNAseq data? Also I think your gff approach should work - you can handle contigs in a metagenome just like any other genome.
For contig annotation Kraken is an excellent tool (though lacks of a good taxonomic binning algorithm, afaik) and as a faster blastp alternative, I recommend diamond
I just wanted to annotate the contigs and I also don't think BLAST would be the best solution and therefore I was asking this question here. Since it is a metatranscripome data, I am not sure if I would be able to use GFF file(s). I am using Trinity assembled data as a reference genome to get read counts from the metatranscriptome data I have.
Hi, I was just wondering if you ended up finding a way to annotate the contigs from Trinity?