Trim CDS to ORF
3
0
Entering edit mode
5 weeks ago
jaredbernard ▴ 30

Does anyone know how to extract CDS from transcripts? I have transcripts structured like this:

Xspecies1234-mRNA transcript offset:272 AED:0.06 eAED:0.06 QI:272|1|1|1|0.8|0.83|6|66|318

Notice the transcript offset part. I have the issue that excluding the 272 bases prior to the offset start codon is not all I need to do to isolate the coding sequence, because the stop codon is somewhere before the end of the transcript. I can fix them manually, but going through thousands of transcripts is extremely time-consuming. Any ideas of how to automate this for a multi-fasta? I'm looking for a way to not only cut off the part prior to the offset but also the part after a within-reading-frame stop codon.

codon CDS offset transcripts • 1.2k views
ADD COMMENT
0
Entering edit mode

give us a full example please (fasta...)

ADD REPLY
0
Entering edit mode

I'd use transdecoder for this if possible , or maybe the newer alternative TD2 - https://github.com/Markusjsommer/TD2/wiki

ADD REPLY
1
Entering edit mode
5 weeks ago
jaredbernard ▴ 30

I haven't found an ORF finder that uses the offsets in the headers. Someone far better at coding than me helped me with this solution:

    library(Biostrings)

    # Load sequences
    fasta_file <- "C:/Users/data/XXspecies.fasta"
    seqs <- readDNAStringSet(fasta_file)

    # Function to extract transcript offset
    extract_offset <- function(header) {
      match <- regexpr("transcript offset:\\d+", header)
      if (match == -1) return(0)  # If not found, don't trim
      offset_str <- regmatches(header, match)
      as.integer(sub("transcript offset:", "", offset_str))
    }

    # Extract offsets
    offsets <- sapply(names(seqs), extract_offset)

    # Trim off the first 'offset' nucleotides
    trimmed_seqs <- DNAStringSet(mapply(function(seq, offset) {
      subseq(seq, start = offset + 1)
    }, seqs, offsets, SIMPLIFY = FALSE))

    names(trimmed_seqs) <- names(seqs)

    # Function to trim a sequence to the first ATG and in-frame stop codon
    trim_to_coding_region <- function(seq) {
      seq_char <- as.character(seq)
      atg_pos <- regexpr("ATG", seq_char)[1]
      if (atg_pos == -1) return(DNAString(""))  # No start codon

      # Scan in-frame triplets starting from ATG
      for (i in seq(from = atg_pos, to = nchar(seq_char) - 2, by = 3)) {
        codon <- substr(seq_char, i, i+2)
        if (codon %in% c("TAA", "TAG", "TGA")) {
          return(DNAString(substr(seq_char, atg_pos, i+2)))  # Inclusive of stop codon
        }
      }
      return(DNAString(""))  # No in-frame stop codon found
    }

    # Apply to all sequences
    trimmed_seqs_2 <- DNAStringSet(lapply(trimmed_seqs, trim_to_coding_region))
    names(trimmed_seqs_2) <- names(trimmed_seqs_2)

    # Write to new FASTA
    writeXStringSet(trimmed_seqs_2, "trimmed_sequences.fasta")

However, this code doesn't work on sequences that don't contain a stop codon. I'd be interested in any other solutions out there.

ADD COMMENT
0
Entering edit mode

is this chatgpt solution?

ADD REPLY
0
Entering edit mode

I tried ChatGPT, but it gave strange answers. See the attempt below. I think it was over-trimming or something, which left me with tiny sequences. I then tried to get ChatGPT to give me a check to see if it was over-trimming, and it gave me completely empty sequences. So AI couldn't ultimately help me much.

ADD REPLY
2
Entering edit mode
5 weeks ago

Why not use an ORF finder tool?

eg. getORF , longestORF, transdecoder, FrameD, ... many options. Unless yo have prior knowledge on your transcripts/ORFs you likley can't do a better job than any of those tools.

ADD COMMENT
0
Entering edit mode

Thanks, @lievensterck. I'll look into those to see if they can accommodate the designated offset.

ADD REPLY
0
Entering edit mode

What is that 'offset' actually? where do you get it from?

I see you mentioned below that sometimes in the offset there might be another potential start codon, but not the correct one. How do you know that?

ADD REPLY
0
Entering edit mode

The transcripts I'm working with are already annotated, so the header includes where the proper coding sequence begins. It says "transcript offset" and then has the number of bases into the read where the CDS begins. It may coincide with the longest ORF, but in any case, there are sometimes start codons earlier but not in the proper ORF or not the longest isoform, etc. So perhaps fetching the longest ORF would generally work -- I'm not sure -- but it seems a shame to not use the offset info provided in the headers.

ADD REPLY
0
Entering edit mode

OK, thanks.

Did maker (or Braker, or ... the annotation tool you used) not provide the corresponding CDS file of the annotations then? Do you need to extract them yourself from the transcripts?

ADD REPLY
0
Entering edit mode

I needed to extract them from the transcripts. The genome was annotated by other researchers and I need to use them in my study. I was able to gather the genes with gene finder tools (OrthoFinder, genblastG, etc.) but then they seemed to be in the wrong ORF or something. I was going to give up on them but noticed the "transcript offset" feature in the headers. Luckily, the script above seems to have done the trick. :^)

ADD REPLY
0
Entering edit mode
5 weeks ago

One could write a very quick and easy script with BioPython/Biostrings (other Bio* modules for other languages) to do this.

Chop off the first 271 nucleotides, then move through 3 bases at a time looking for a stop codon.

ADD COMMENT
0
Entering edit mode

Thanks, @iansudbery. I'm looking at Biostrings now actually, using R.

GetORF has an offset tag, but it appears to need all transcripts to have the same offset. In my situation, every transcript has a different offset, and sometimes the part within the offset contains a start codon, which would be the wrong one. I'm toying with code right now. But the problem is that my code as it stands chops off too much, as I'm not very skilled with coding. I think it's trimming the trimmed transcripts or something so nothing is left in the end. Any thoughts?

library(Biostrings)    
# Load the FASTA file
fasta_file <- "/data/Xspecies.fasta"
seqs <- readDNAStringSet(fasta_file)

# Function to extract "transcript offset" from FASTA headers
extract_offset <- function(header) {
  match <- regexpr("transcript offset:\\d+", header)
  if (match == -1) return(NA)
  offset_str <- regmatches(header, match)
  as.integer(sub("transcript offset:", "", offset_str))
}

# Function to trim from known offset to the first in-frame stop codon
trim_from_offset <- function(seq, offset) {
  seq_char <- as.character(seq)
  if (is.na(offset) || offset > nchar(seq_char) - 3) return(DNAString(""))

  # Loop through in-frame codons starting at the offset
  for (i in seq(from = offset, to = nchar(seq_char) - 2, by = 3)) {
    codon <- substr(seq_char, i, i + 2)
    if (codon %in% c("TAA", "TAG", "TGA")) {
      return(DNAString(substr(seq_char, offset, i + 2)))  # inclusive of stop codon
    }
  }
  return(DNAString(""))  # no valid stop codon found
}

# Extract offsets from headers
offsets <- sapply(names(seqs), extract_offset)

# Apply trimming
trimmed_seqs <- DNAStringSet(mapply(trim_from_offset, seqs, offsets, SIMPLIFY = FALSE))
names(trimmed_seqs) <- names(seqs)

# Optional: Remove empty sequences where no valid ORF was found
trimmed_seqs <- trimmed_seqs[width(trimmed_seqs) > 0]

# Save to a new FASTA file
writeXStringSet(trimmed_seqs, "trimmed_sequences_by_offset.fasta")
ADD REPLY
0
Entering edit mode

You code looks fine to me, from a logical perspective, but without examples to try it on, I can't say why its not working. I wonder if its a problem with bases. R is one based. Is it possble that your offsets are 0-based? Can you check that the first codon after the offset is a Met?

ADD REPLY
0
Entering edit mode

Yes, the first codon after the offset is methionine. The above solution worked, breaking the process into two phases.

ADD REPLY

Login before adding your answer.

Traffic: 2249 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6