How to merge similar transcripts together?
1
1
Entering edit mode
5.9 years ago
satshil.r ▴ 50

Hello,

I have 8 transcriptomes which I assembled. 4 are from a gonad and 4 are from an ovary. There are also 2 groups within each tissue. I want to create a reference transcriptome containing the contigs from all 8 of my samples, annotate the reference and use it for DGE.

I cat'd the 8 files together and I have a 2GB, 1.8million contig file. I ran cd-hit-est with the following options:

"cd-hit-est -i reference.fa -o reference_90.fa -c 0.9 -T 0 -M 0"

The resulting file was still large with around 1.3M contigs and 1.6GB. How can I merge these remaining contigs more? When I annotated this 1.3M contig fasta I had multiple repeats for almost all genes, so there has to be a more concise way to merge similar transcripts together. Can someone help me out with this problem?

cd-hit fasta Assembly RNA-Seq • 2.0k views
ADD COMMENT
0
Entering edit mode

1.6Gb for a transcriptome? 1.3 million contigs? This either is not a good assembly, or you are working with some crazy organism. What kind of organism are you working on?

ADD REPLY
0
Entering edit mode

It's a multi-kmer approach, so it's technically multiple assemblies combined together. That's why I want to reduce the redundancy. The Assembly is good,

ADD REPLY
0
Entering edit mode
5.9 years ago

The BBMap package has a program called Dedupe that collapses exact or contained duplicate sequences. Usage:

dedupe.sh in=contigs.fa out=nodupes.fa

If you want to avoid absorbing different-length transcripts, you can use the flag minlengthpercent=90 or similar; if you want to allow up to 3 substitutions difference, you can ushe the flag s=1; etc.

ADD COMMENT

Login before adding your answer.

Traffic: 1056 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6