Find all fragments of n length in a larger sequence
1
0
Entering edit mode
3.1 years ago
myrmex • 0

I am working on designing dsRNA and I want to check each smaller section of the sequence for possible off-target effects.

I have a 500 bp sequence. I want to write a script that extracts all possible 20 bp fragments within the longer sequence. I am interested in automating this rather than doing it manually because I may repeat the process several times.

Then I want to BLAST each one of those 20 bp sequences against the honey bee genome (with the GOI masked) to make sure each fragment doesn't have perfect alignment anywhere other than the GOI.

Any help is very appreciated!

alignment • 645 views
0
Entering edit mode

Dear myrmex, you may be interested in having a look at our SEDA software (https://www.sing-group.org/seda/download.html). It contains several functions for filter, transformation, and manipulation of FASTA files, including operations to perform batch BLAST queries (https://www.sing-group.org/seda/manual/index.html). Regards.

1
Entering edit mode
3.1 years ago

Here's a way of obtaining the 20-mers in R using the Biostrings package (Bioconductor):

> library(Biostrings)
> DNA_ALPHABET
[1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."
> seq <- paste(sample(DNA_ALPHABET[1:4], size = 500, replace = TRUE), collapse = "")
> seq <- DNAString(seq)
> seq
500-letter "DNAString" instance
seq: ATTCAAGTAGTAGTTACGGGAATGCCCACAGGGGCCAAGCGCAGTAGAAGGTACCTCCACCGTGCATTGACGGATGGGAGCCTGTGATGCCCGCAATGGTGAGTAAACTCCTGAAG...CTGCAGGTTCCAAACCAGACGCGTTTCCGGTGCAGTAGACGATATACCGATTACGGTCCAAGCTAGCAAGGGGTAGTCGCGAGGTCACCAGCCATCCGAAGGACGCGCCCAGAAA
> views <- Views(seq, start = 1:481, end = 20:500)
> views
Views on a 500-letter DNAString subject
subject: ATTCAAGTAGTAGTTACGGGAATGCCCACAGGGGCCAAGCGCAGTAGAAGGTACCTCCACCGTGCATTGACGGATGGGAGCCTGTGATGCCCGCAATGGTGAGTAAACTCCTGA...GCAGGTTCCAAACCAGACGCGTTTCCGGTGCAGTAGACGATATACCGATTACGGTCCAAGCTAGCAAGGGGTAGTCGCGAGGTCACCAGCCATCCGAAGGACGCGCCCAGAAA
views:
start end width
[1]     1  20    20 [ATTCAAGTAGTAGTTACGGG]
[2]     2  21    20 [TTCAAGTAGTAGTTACGGGA]
[3]     3  22    20 [TCAAGTAGTAGTTACGGGAA]
[4]     4  23    20 [CAAGTAGTAGTTACGGGAAT]
[5]     5  24    20 [AAGTAGTAGTTACGGGAATG]
...   ... ...   ... ...
[477]   477 496    20 [CCATCCGAAGGACGCGCCCA]
[478]   478 497    20 [CATCCGAAGGACGCGCCCAG]
[479]   479 498    20 [ATCCGAAGGACGCGCCCAGA]
[480]   480 499    20 [TCCGAAGGACGCGCCCAGAA]
[481]   481 500    20 [CCGAAGGACGCGCCCAGAAA]
> twenty.mers <- DNAStringSet(views)
> twenty.mers
A DNAStringSet instance of length 481
width seq
[1]    20 ATTCAAGTAGTAGTTACGGG
[2]    20 TTCAAGTAGTAGTTACGGGA
[3]    20 TCAAGTAGTAGTTACGGGAA
[4]    20 CAAGTAGTAGTTACGGGAAT
[5]    20 AAGTAGTAGTTACGGGAATG
...   ... ...
[477]    20 CCATCCGAAGGACGCGCCCA
[478]    20 CATCCGAAGGACGCGCCCAG
[479]    20 ATCCGAAGGACGCGCCCAGA
[480]    20 TCCGAAGGACGCGCCCAGAA
[481]    20 CCGAAGGACGCGCCCAGAAA
> twenty.mers[1]
A DNAStringSet instance of length 1
width seq
[1]    20 ATTCAAGTAGTAGTTACGGG


For performing BLAST, you could try the following method using the matchPattern (Biostrings package) as per the following link (see first answer).