Using transdecoder to filter a fasta file with preserving whole transcripts
1
0
Entering edit mode
2.1 years ago
poecile.pal ▴ 50

Good evening,

I have .fasta file transcripts_longestiso.fasta (output of rnaSPAdes) that looks like this

>NODE_1_length_24937_cov_13085.239743_g0_i0
CCTTTTTATTTCTCATCAAATGAATGGCATCTTCTTCTGGAA...
>NODE_7_length_14915_cov_350.128521_g1_i0
AAATAGAAGGGTCT...

etc

I need to discard transcripts with ORF length less than 100 amino acids (300 nucleotides). To do this, I launched

TransDecoder.LongOrfs -t transcripts_longestiso.fasta

I got longest_orfs.pep containing the predicted amino acid sequences, if I understand correctly. Like this:

>NODE_1_length_24937_cov_13085.239743_g0_i0.p1 type:complete len:247 gc:universal NODE_1_length_24937_cov_13085.239743_g0_i0:9145-9885(+)
MRPLEAP...
>NODE_1_length_24937_cov_13085.239743_g0_i0.p2 type:complete len:247 gc:universal NODE_1_length_24937_cov_13085.239743_g0_i0:15793-15053(-)
MRPLE...

etc

And longest_orfs.gff3

NODE_1_length_24937_cov_13085.239743_g0_i0      transdecoder    gene    1       24937   .       +       .
       ID=GENE.NODE_1_length_24937_cov_13085.239743_g0_i0~~NODE_1_length_24937_cov_13085.239743_g0_i0.p1;Name=ORF%20type%3Acomplete%20len%3A247%20%28%2B%29
NODE_1_length_24937_cov_13085.239743_g0_i0      transdecoder    mRNA    1       24937   .       +       .
       ID=NODE_1_length_24937_cov_13085.239743_g0_i0.p1;Parent=GENE.NODE_1_length_24937_cov_13085.239743_g0_i0~~NODE_1_length_24937_cov_13085.239743_g0_i0.p1;Name=ORF%20type%3Acomplete%20len%3A247%20%28%2B%29

And other files that are transdecoder output.

But I would like to get the same fasta file at the output, only filtered. With whole transcripts and in a nucleotide record. But those transcripts that do not meet my condition should not be included in this file.

For example, if in NODE_1_length_24937_cov_13085.239743_g0_i0 transcript transdecoder found an ORF with appropriate length, and in NODE_7_length_14915_cov_350.128521_g1_i0 - did not find it, then the output fasta file should be

>NODE_1_length_24937_cov_13085.239743_g0_i0
CCTTTTTATTTCTCATCAAATGAATGGCATCTTCTTCTGGAA...

Could you please tell me how to do this?

Thank you in advance!

Best regards, Poecile

transcriptome assembly rna-seq fasta transdecoder • 909 views
ADD COMMENT
0
Entering edit mode

It's tl;dr and confusing.

ADD REPLY
2
Entering edit mode
2.1 years ago
poecile.pal ▴ 50

I solved the problem on my own, and I hope this will help someone too)

1) Extract the first column with seqid from pep file (without > at the beginning and without .pN at the end)

$ sed -n '/^>/p' longest_orfs.pep > longest_orfs_ids.pep
$ cut -f1 -d ' ' < longest_orfs_ids.pep > longest_orfs_seqids_frompep.txt
$ cut -f1 -d 'p' < longest_orfs_seqids_frompep.txt > longest_orfs_seqids_withoutpN_frompep.txt
$ sed 's/^.//;s/.$//' longest_orfs_seqids_withoutpN_frompep.txt > longest_orfs_seqids_withoutpN_withoutdotsandticks_frompep.txt

2) It turns out a lot of duplicates. Remove them.

$ sort longest_orfs_seqids_withoutpN_withoutdotsandticks_frompep.txt | uniq > longest_orfs_seqids_withoutpN_withoutdotsandticks_frompep_uniq.txt

3) Extract from fasta transcripts listed in longest_orfs_seqid_uniq.txt

$ seqtk subseq transcripts_longestiso.fasta ./transcripts_longestiso.fasta.transdecoder_dir/longest_orfs_seqids_withoutpN_withoutdotsandticks_frompep_uniq.txt > transcripts_longestiso_longorfs.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 2359 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6