Hello,
I have 8 transcriptomes which I assembled. 4 are from a gonad and 4 are from an ovary. There are also 2 groups within each tissue. I want to create a reference transcriptome containing the contigs from all 8 of my samples, annotate the reference and use it for DGE.
I cat'd the 8 files together and I have a 2GB, 1.8million contig file. I ran cd-hit-est with the following options:
"cd-hit-est -i reference.fa -o reference_90.fa -c 0.9 -T 0 -M 0"
The resulting file was still large with around 1.3M contigs and 1.6GB. How can I merge these remaining contigs more? When I annotated this 1.3M contig fasta I had multiple repeats for almost all genes, so there has to be a more concise way to merge similar transcripts together. Can someone help me out with this problem?
1.6Gb for a transcriptome? 1.3 million contigs? This either is not a good assembly, or you are working with some crazy organism. What kind of organism are you working on?
It's a multi-kmer approach, so it's technically multiple assemblies combined together. That's why I want to reduce the redundancy. The Assembly is good,