Question: how to use CD_HIT to remove the redundant sequence from trinity output file
0
gravatar for wu.zhiqiang.1020
4 months ago by
United States
wu.zhiqiang.102020 wrote:

Dear all, I am trying to use CD-hit to remove the duplicates from the file that is the output from trinity (RNA seq assembly).

I used the following parameters:

cd-hit-est -i in.fasta -o out_cdhit90.fasta -c 0.90 -n 9 -d 0 -M 0 -T 0

But the output file still contains lots of small or fragmented sequence plus the best one. How can I remove those small or fragmented duplicates by changing the parameters?

thanks

ZQ

rna-seq assembly • 294 views
ADD COMMENTlink modified 4 months ago by st.ph.n990 • written 4 months ago by wu.zhiqiang.102020

Are you trying to get the longest isoform per gene cluster?

ADD REPLYlink written 4 months ago by st.ph.n990

yes, I just want to have one longest isoform. How can I realize that? ZQ

ADD REPLYlink written 4 months ago by wu.zhiqiang.102020
0
gravatar for Sej Modha
4 months ago by
Sej Modha1.5k
Glasgow, UK
Sej Modha1.5k wrote:

You can change -c and -t parameters on cd-hit that control the identity threshold and redundancy tolerance.

More info: http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide

ADD COMMENTlink modified 4 months ago • written 4 months ago by Sej Modha1.5k

you mean -c and -t right?

ADD REPLYlink written 4 months ago by wu.zhiqiang.102020
0
gravatar for st.ph.n
4 months ago by
st.ph.n990
Philadelphia, PA
st.ph.n990 wrote:

Trinity FASTA entry example:

>TRINITY_DN1000|c115_g5_i1 len=247 path=[31015:0-148 23018:149-246]
  

AATCTTTTTTGGTATTGGCAGTACTGTGCTCTGGGTAGTGATTAGGGCAAAAGAAGACAC ACAATAAAGAACCAGGTGTTAGACGTCAGCAAGTCAAGGCCTTGGTTCTCAGCAGACAGA AGACAGCCCTTCTCAATCCTCATCCCTTCCCTGAACAGACATGTCTTCTGCAAGCTTCTC CAAGTCAGTTGTTCACAGGAACATCATCAGAATAAATTTGAAATTATGATTAGTATCTGA TAAAGCA

Description of Trinity headers: 'TRINITY_DN1000|c115_g5' corresponds to

'gene id: TRINITY_DN1000|c115_g5' encoding 'isoform id: TRINITY_DN1000|c115_g5_i1'.

I prefer single line FASTA, borrowed from answer on this post:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < Trinity.fasta > Trinity_single.fasta

Here's a python script to get the longest isoform per gene:

#!/usr/bin/env python

import sys
from collections import defaultdict

transcripts = defaultdict(list)
with open(sys.argv[1], 'r') as f:
     for line in f:
        if line.startswith(">"):
            transcripts['_'.join(line.strip().split(' ')[0].split('|')[1].split('_')[:2])].append(next(f).strip())  # Groups isoforms by gene id

with open(sys.argv[1].split(".fasta")[1], 'w') as out:
    for t in transcripts:
        print >> out, '>' + t, max(transcripts[t], key=len)

Save as get_longest_iso.py, run as python get_longest_iso.py Trinity_single.fasta

ADD COMMENTlink modified 4 months ago • written 4 months ago by st.ph.n990

it would be helpful to keep the longest isoforms. but I still could not remove the duplicates from other contrig. right? thanks

ADD REPLYlink written 4 months ago by wu.zhiqiang.102020

Trinity should have done the gene clustering for you. This script will take the longest isoform that exists within a gene cluster based on gene ids.

ADD REPLYlink written 4 months ago by st.ph.n990

thanks。 I need to read Trinity again.

ADD REPLYlink written 4 months ago by wu.zhiqiang.102020

Hi st.ph.n I tried your script, it shown error as

Traceback (most recent call last):

File "get_longest_iso.py", line 8, in <module>

if line.startswith(">"):

NameError: name 'line' is not defined

could you help me to fix this? I do not know the python. thanks

My trinity out start as:

TRINITY_DN62966_c0_g1_i1 len=352 path=[571:0-44 572:45-65 573:66-351] [-1, 571, 572, 573, -2]

ADD REPLYlink modified 4 months ago • written 4 months ago by wu.zhiqiang.102020

Sorry forgot a for statement, wrote it a little too quickly.

ADD REPLYlink written 4 months ago by st.ph.n990
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 467 users visited in the last hour