Question: how to use CD_HIT to remove the redundant sequence from trinity output file
0
gravatar for wu.zhiqiang.1020
9 weeks ago by
United States
wu.zhiqiang.102020 wrote:

Dear all, I am trying to use CD-hit to remove the duplicates from the file that is the output from trinity (RNA seq assembly).

I used the following parameters:

cd-hit-est -i in.fasta -o out_cdhit90.fasta -c 0.90 -n 9 -d 0 -M 0 -T 0

But the output file still contains lots of small or fragmented sequence plus the best one. How can I remove those small or fragmented duplicates by changing the parameters?

thanks

ZQ

rna-seq assembly • 218 views
ADD COMMENTlink modified 9 weeks ago by st.ph.n730 • written 9 weeks ago by wu.zhiqiang.102020

Are you trying to get the longest isoform per gene cluster?

ADD REPLYlink written 9 weeks ago by st.ph.n730

yes, I just want to have one longest isoform. How can I realize that? ZQ

ADD REPLYlink written 9 weeks ago by wu.zhiqiang.102020
0
gravatar for Sej Modha
9 weeks ago by
Sej Modha1.2k
Glasgow, UK
Sej Modha1.2k wrote:

You can change -c and -t parameters on cd-hit that control the identity threshold and redundancy tolerance.

More info: http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Sej Modha1.2k

you mean -c and -t right?

ADD REPLYlink written 9 weeks ago by wu.zhiqiang.102020
0
gravatar for st.ph.n
9 weeks ago by
st.ph.n730
Philadelphia, PA
st.ph.n730 wrote:

Trinity FASTA entry example:

>TRINITY_DN1000|c115_g5_i1 len=247 path=[31015:0-148 23018:149-246]
  

AATCTTTTTTGGTATTGGCAGTACTGTGCTCTGGGTAGTGATTAGGGCAAAAGAAGACAC ACAATAAAGAACCAGGTGTTAGACGTCAGCAAGTCAAGGCCTTGGTTCTCAGCAGACAGA AGACAGCCCTTCTCAATCCTCATCCCTTCCCTGAACAGACATGTCTTCTGCAAGCTTCTC CAAGTCAGTTGTTCACAGGAACATCATCAGAATAAATTTGAAATTATGATTAGTATCTGA TAAAGCA

Description of Trinity headers: 'TRINITY_DN1000|c115_g5' corresponds to

'gene id: TRINITY_DN1000|c115_g5' encoding 'isoform id: TRINITY_DN1000|c115_g5_i1'.

I prefer single line FASTA, borrowed from answer on this post:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < Trinity.fasta > Trinity_single.fasta

Here's a python script to get the longest isoform per gene:

#!/usr/bin/env python

import sys
from collections import defaultdict

transcripts = defaultdict(list)
with open(sys.argv[1], 'r') as f:
     for line in f:
        if line.startswith(">"):
            transcripts['_'.join(line.strip().split(' ')[0].split('|')[1].split('_')[:2])].append(next(f).strip())  # Groups isoforms by gene id

with open(sys.argv[1].split(".fasta")[1], 'w') as out:
    for t in transcripts:
        print >> out, '>' + t, max(transcripts[t], key=len)

Save as get_longest_iso.py, run as python get_longest_iso.py Trinity_single.fasta

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by st.ph.n730

it would be helpful to keep the longest isoforms. but I still could not remove the duplicates from other contrig. right? thanks

ADD REPLYlink written 9 weeks ago by wu.zhiqiang.102020

Trinity should have done the gene clustering for you. This script will take the longest isoform that exists within a gene cluster based on gene ids.

ADD REPLYlink written 9 weeks ago by st.ph.n730

thanks。 I need to read Trinity again.

ADD REPLYlink written 9 weeks ago by wu.zhiqiang.102020

Hi st.ph.n I tried your script, it shown error as

Traceback (most recent call last):

File "get_longest_iso.py", line 8, in <module>

if line.startswith(">"):

NameError: name 'line' is not defined

could you help me to fix this? I do not know the python. thanks

My trinity out start as:

TRINITY_DN62966_c0_g1_i1 len=352 path=[571:0-44 572:45-65 573:66-351] [-1, 571, 572, 573, -2]

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by wu.zhiqiang.102020

Sorry forgot a for statement, wrote it a little too quickly.

ADD REPLYlink written 9 weeks ago by st.ph.n730
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 481 users visited in the last hour