Question

Extracting sequences from FASTA beginning with common 5' end

0

Entering edit mode

5.6 years ago

khcole • 0

Hello,

I am trying to figure out the best way to extract sequences from a FASTA file which begin with a common 5' region of 43 nucleotides. Preferably, I would like to to allow for "fuzziness" in this region to allow for mutations or read overlaps. The idea is that any sequences that begin with this region, regardless of size, could be extracted into a new file.

Example of the FASTA file:

>1-69050-454.08
GTACGGGGAAGGACGTCAATAGTC

>2-65989-433.95
AATCTGTAGTACGACTCACTATAGC

>3-62181-408.91
GATCTGTAATACGACTCACTATAGG

>4-49959-328.53
GGGGAAGGACGTCAATAGTCACAC

And what I would like to get from the code is such:

>2-65989-433.95
AATCTGTAGTACGACTCACTATAGC

>3-62181-408.91
GATCTGTAATACGACTCACTATAGG

The FASTA file is quite large and I have tried using a couple grep and awk methods retrieve the sequences. Any help you could provide is much appreciated.

sequence next-gen • 779 views

ADD COMMENT • link 5.6 years ago by khcole • 0

1

Entering edit mode

are you familiar with any scripting language? Perl? Python? or linux only?

Also: how fuzzy do you want to be?

ADD REPLY • link 5.6 years ago by lieven.sterck 15k

0

Entering edit mode

I have some knowledge of python and Perl. Most of my knowledge is purely bash.

Been working a little with Biopython but have yet to find a way to extract utilizing that platform.

ADD REPLY • link 5.6 years ago by khcole • 0

0

Entering edit mode

If you really want to be fuzzy, you'll probably be best of with building a model and screen the sequences with it. I'm thinking of a PWM or and HMM model. If the fuzziness is limited (eg. only a single or a few specific positions) you could construct a simple reg-ex and screen them with awk or grep indeed.

ADD REPLY • link 5.6 years ago by lieven.sterck 15k