Entering edit mode
5.6 years ago
khcole
•
0
Hello,
I am trying to figure out the best way to extract sequences from a FASTA file which begin with a common 5' region of 43 nucleotides. Preferably, I would like to to allow for "fuzziness" in this region to allow for mutations or read overlaps. The idea is that any sequences that begin with this region, regardless of size, could be extracted into a new file.
Example of the FASTA file:
>1-69050-454.08
GTACGGGGAAGGACGTCAATAGTC
>2-65989-433.95
AATCTGTAGTACGACTCACTATAGC
>3-62181-408.91
GATCTGTAATACGACTCACTATAGG
>4-49959-328.53
GGGGAAGGACGTCAATAGTCACAC
And what I would like to get from the code is such:
>2-65989-433.95
AATCTGTAGTACGACTCACTATAGC
>3-62181-408.91
GATCTGTAATACGACTCACTATAGG
The FASTA file is quite large and I have tried using a couple grep and awk methods retrieve the sequences. Any help you could provide is much appreciated.
are you familiar with any scripting language? Perl? Python? or linux only?
Also: how fuzzy do you want to be?
I have some knowledge of python and Perl. Most of my knowledge is purely bash.
Been working a little with Biopython but have yet to find a way to extract utilizing that platform.
If you really want to be fuzzy, you'll probably be best of with building a model and screen the sequences with it. I'm thinking of a PWM or and HMM model. If the fuzziness is limited (eg. only a single or a few specific positions) you could construct a simple reg-ex and screen them with awk or grep indeed.