Question: How to merge similar transcripts together?
3.5 years ago by
I have 8 transcriptomes which I assembled. 4 are from a gonad and 4 are from an ovary. There are also 2 groups within each tissue. I want to create a reference transcriptome containing the contigs from all 8 of my samples, annotate the reference and use it for DGE.

I cat'd the 8 files together and I have a 2GB, 1.8million contig file. I ran cd-hit-est with the following options:

"cd-hit-est -i reference.fa -o reference_90.fa -c 0.9 -T 0 -M 0"

The resulting file was still large with around 1.3M contigs and 1.6GB. How can I merge these remaining contigs more? When I annotated this 1.3M contig fasta I had multiple repeats for almost all genes, so there has to be a more concise way to merge similar transcripts together. Can someone help me out with this problem?

ADD COMMENTlink modified 3.5 years ago by Brian Bushnell16k • written 3.5 years ago by satshil.r50

written 3.5 years ago by satshil.r50

1.6Gb for a transcriptome? 1.3 million contigs? This either is not a good assembly, or you are working with some crazy organism. What kind of organism are you working on?

ADD REPLYlink written 3.5 years ago by h.mon25k

It's a multi-kmer approach, so it's technically multiple assemblies combined together. That's why I want to reduce the redundancy. The Assembly is good,

ADD REPLYlink written 3.5 years ago by satshil.r50
3.5 years ago by
The BBMap package has a program called Dedupe that collapses exact or contained duplicate sequences.  Usage: in=contigs.fa out=nodupes.fa

If you want to avoid absorbing different-length transcripts, you can use the flag "minlengthpercent=90" or similar; if you want to allow up to 3 substitutions difference, you can ushe the flag "s=1"; etc.

ADD COMMENTlink written 3.5 years ago by Brian Bushnell16k
