Question

Clarification on conceptual question regarding ORF-calling

0

Entering edit mode

13 months ago

Daniel ▴ 30

Hello,

I have a conceptual question that I think I may have the answer to, but would appreciate feedback. There are many ORF-calling tools out there, and many of them take in a fasta file as input (such as orfipy). My question is: How do these tools use fasta files as inputs, if many times these files have reads from sequencing that have not yet been aligned?

Since we do not output fasta files after running aligners such as STAR, why do these tools take in fasta files, and not a bam file? I assume they need the aligned sequence (not just read), so that it can ORF call for an entire length of a gene.

Thus, if I want to use these tools, should I figure out how to take my aligned output (probably the bam file), and convert that into a fasta file where each line is no longer a read but a transcript? I believe the fasta file has to be a multi-fasta file, but when I google this format, it is not clear whether this is for storing aligned sequences, or just sequences from multiple fasta files.

Thank you!

ORFIPY ORF • 773 views

ADD COMMENT • link 13 months ago by Daniel ▴ 30

score 0 · Answer 1 · 2023-04-14

0

Entering edit mode

13 months ago

Mensur Dlakic ★ 27k

Most ORF tools use assemblies, where individual sequencing reads are joined by overlaps into large contigs. There isn't enough length in short reads to predict ORFs. As to transcripts, they may be without start or stop codons, or lacking introns in eukaryotes.

ADD COMMENT • link 13 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thanks for your response. So when we are aligning sequencing reads, are we creating an assembly? If yes, I assume then there's a way to turn alignment output back into fasta format?

ADD REPLY • link 13 months ago by Daniel ▴ 30

1

Entering edit mode

Sequencing reads are typically aligned to either genome or transcriptome assemblies. Assemblies already are FASTA files, so there is no need for conversion. Instead, we predict ORFs directly from those assembly files, and the alignment step is unnecessary.

So when we are aligning sequencing reads, are we creating an assembly?

Genomic assemblies are representations of genomic DNA sequences. The assemblies can be complete or not. Aligning sequencing reads to assemblies is done for different reasons, but most of them have nothing to do with ORF prediction.

ADD REPLY • link 13 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I see. My rationale for wanting to use my aligned reads is because I wanted to predicted ORFs on the real reads from my data (in case there are mutations), not the mapped assemblies. But given your answer, it seems like there'd be an insignificant difference. Thank you!

ADD REPLY • link 13 months ago by Daniel ▴ 30