Question: Sub Sequence extract from single Fasta sequence
0
gravatar for gnomicData
18 months ago by
gnomicData0
gnomicData0 wrote:

Hi everyone,

I am trying to isolate a set of sequence ranging from 6 to 15 nucleotide in size from a single DNA sequence about 130 nt long. The goal is to isolate all sequences with size ranges starting from 6 nucleotide upto 15 nucleotide long. These need to be all seqeunces as well and not unique sequences only. For example, if my single large 130 nt DNA sequence is AATTGCGCTAGCGACTAGG........AGAGATAAGAGAGATTCTCTCA then starting from the left end, I will need AATTGC as well as ATTGCG and so on. Basically all possible 6 nucleotide sequences from the whole single 130 nt DNA sequence file. This pattern needs to be repeated for other sequence sizes as well (up to 15 nucleotides) Technically, it would be best to get the outputs in individual files (where each files will contain all possible sequences categorized by the size like 8, 9 , 10 ....up to 15). However, its also okay if the entire output is generated into a single file.

I need this code to work on R ( I am working with RStudio).

Currently I have tried splitseq but I really cannot loop with it. There i am providing a single value every time for the “frame” option and again a single value for the “word” option.

Something like:

splitseq(CDS, frame = 0, word = 6)   ## word=7, 8 ,9 etc for other nucleotide sizes extraction
[1] "TATCGT" "GCAGCA" "TCGGAC" "TAGCAT" "CGATCG" "ATTTAC" "ATCATC"

where CDS is the variable assigned with the sequence (~ 130nt)

so to get then next frame

splitseq(CDS, frame = 1 , word = 6)
[1] "ATCGTG" "CAGCAT" "CGGACT" "AGCATC" "GATCGA" "TTTACA" "TCATCT"

etc....

But this is not really automating the procedure and it will take me forever to complete the 6 nucleotide work, not to mention the rest of the nucleotide sizes upto 15 (word= 7, 8, 9,.... upto 15).

I have also come across a previous post here that talked about using a combination of DNAstrings, Views from the Biostrings package. However the problem is I need to loop through all sequences from 6 to 15 nucleotide size for all positions in this ~ 130 nt fasta sequence.

Any scripts/ suggestions will be most appreciated !

Thanks very much !!

sequence • 348 views
ADD COMMENTlink modified 18 months ago • written 18 months ago by gnomicData0
1

If R software is not a strict requirement then use jellyfish to get these counts (http://www.genome.umd.edu/jellyfish.html ).

ADD REPLYlink written 18 months ago by GenoMax93k
1

Also seqkit sliding:

for i in {6..15}; do seqkit sliding --step 1 --window ${i} test.fa > word${i}.fa; done
ADD REPLYlink modified 18 months ago • written 18 months ago by AK1.9k

Unfortunately I needed this in R only. But always good to know about tools (jellyfish) and their potentials. Thanks @Genomax !

ADD REPLYlink written 18 months ago by gnomicData0
1
gravatar for AK
18 months ago by
AK1.9k
AK1.9k wrote:

Perhaps like this?

CDS <- "AATTGCGCTAGCGACTAGG"

for (w in 6:15) {
  sink(paste0("word", w, ".fa"))
  for (start in 1:(nchar(CDS) - w + 1)) {
    end <- start + w - 1
    sub_CDS <- substr(CDS, start, end)
    cat(paste0(">", w, "_", start, ":", end), "\n")
    cat(sub_CDS, "\n")
  }
  sink()
}
ADD COMMENTlink modified 18 months ago • written 18 months ago by AK1.9k

Absolutely fantastic ! Just what I needed Thank you very much !! @SMK

ADD REPLYlink written 18 months ago by gnomicData0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1121 users visited in the last hour