How many of the SSR contained transcripts have ORF?
1
0
Entering edit mode
4.0 years ago
Farbod ★ 3.3k

Dear Biostars, Hi

I have searched my transcripts (longest isoform of each gene from RNA-seq data) using MISA to report any potential SSRs. My total number of SSR containing sequences is 93022.

Q: How to figure out that how many of these sequences/transcripts contain any ORF ?

Thanks

NOTE:

I have used Transdecoder to discover ORF of my whole transcripts, too. But I can not test all 93022 ID in Transdecoder result, manually.

ORF MISA SSR RNA-Seq • 1.1k views
0
Entering edit mode

can I collect the ssr contained transcript IDs in a text file and check for their representative in Trinity.fasta.transdecoder.pep file using some linux command line tools such as grep -F -f ?

1
Entering edit mode
4.0 years ago
h.mon 33k

If I recall correctly (and I am mostly certain I do), Trinotate Transdecoder outputs a Trinity.fasta.transdecoder.bed, you could use this bed to get a orfs fasta and predict SSRs with MISA on this file.

0
Entering edit mode

Hi @h.mon and thanks,

By Trinotate, you mean Transdecoder?

The Transdecoder produce a .bed files, too as you mentioned.

You mean I should use that as my main transcript file in MISA?

NOTE:

the head of bed file is as:

track name='Trinity.fasta.transdecoder.gff3'

TRINITY_DN10003_c0_g1_i1 0 395 ID=TRINITY_DN10003_c0_g1_i1.p1;

TRINITY_DN10003_c0_g1~~TRINITY_DN10003_c0_g1_i1.p1;ORF_type:5prime_partial_len:125_(+),score=5.10 0 + 2 377 0 1 395 0

TRINITY_DN100126_c0_g1_i1 0 624 ID=TRINITY_DN100126_c0_g1_i1.p1;TRINITY_DN100126_c0_g1~~TRINITY_DN100126_c0_g1_i1.p1;ORF_type:complete_len:120_(+),score=39.02 0 +

1
Entering edit mode

The Transdecoder produce a .bed files, too as you mentioned. You mean I should use that as my main transcript file in MISA?

No, you should use something like bedtools getfasta and use the resulting fasta as input to MISA.

0
Entering edit mode

Thanks, it seems that bedtools get fasta has many switches and options,

merging the Transdecoder .bed and original Trinity.fasta is what we intend to do?

1
Entering edit mode

Untested:

bedtools getfasta -fo orfs.fas -fi Trinity.fasta -bed Trinity.fasta.transdecoder.bed


You possibly want -split.

0
Entering edit mode

I created orfs.fas using your guidance

I guess I could use it as the MISA main file now, so why I need the -split option?

0
Entering edit mode

It seems that I can not use this approach because I have used all isoforms for my Transdecoder ORF determination BUT I have used longest isoforms for each gene for SSR mining.

So my .bed file have many more member than my fasta file were used for SSR. So, the number of SSr contained transcripts that have orf CAN be more than the total number of SSR containing sequences!

Maybe I could use some linux script to collect 93022 SSR transcript IDs and collect their ORF results from Transdecoder.pep and then count them?

1
Entering edit mode

You can use the fasta with longest orfs, provided the names of the sequences have not changed.

bedtools getfasta -fo orfs.fas -fi Trinity.longest.fasta -bed Trinity.fasta.transdecoder.bed


You will get a lot of warnings about names found in the bed file and absent in the fasta, as you removed sequences from the fasta.