I haven't found an ORF finder that uses the offsets in the headers. Someone far better at coding than me helped me with this solution:
library(Biostrings)
# Load sequences
fasta_file <- "C:/Users/data/XXspecies.fasta"
seqs <- readDNAStringSet(fasta_file)
# Function to extract transcript offset
extract_offset <- function(header) {
match <- regexpr("transcript offset:\\d+", header)
if (match == -1) return(0) # If not found, don't trim
offset_str <- regmatches(header, match)
as.integer(sub("transcript offset:", "", offset_str))
}
# Extract offsets
offsets <- sapply(names(seqs), extract_offset)
# Trim off the first 'offset' nucleotides
trimmed_seqs <- DNAStringSet(mapply(function(seq, offset) {
subseq(seq, start = offset + 1)
}, seqs, offsets, SIMPLIFY = FALSE))
names(trimmed_seqs) <- names(seqs)
# Function to trim a sequence to the first ATG and in-frame stop codon
trim_to_coding_region <- function(seq) {
seq_char <- as.character(seq)
atg_pos <- regexpr("ATG", seq_char)[1]
if (atg_pos == -1) return(DNAString("")) # No start codon
# Scan in-frame triplets starting from ATG
for (i in seq(from = atg_pos, to = nchar(seq_char) - 2, by = 3)) {
codon <- substr(seq_char, i, i+2)
if (codon %in% c("TAA", "TAG", "TGA")) {
return(DNAString(substr(seq_char, atg_pos, i+2))) # Inclusive of stop codon
}
}
return(DNAString("")) # No in-frame stop codon found
}
# Apply to all sequences
trimmed_seqs_2 <- DNAStringSet(lapply(trimmed_seqs, trim_to_coding_region))
names(trimmed_seqs_2) <- names(trimmed_seqs_2)
# Write to new FASTA
writeXStringSet(trimmed_seqs_2, "trimmed_sequences.fasta")
However, this code doesn't work on sequences that don't contain a stop codon. I'd be interested in any other solutions out there.
give us a full example please (fasta...)
I'd use transdecoder for this if possible , or maybe the newer alternative TD2 - https://github.com/Markusjsommer/TD2/wiki