Extracting sequences from FASTA beginning with common 5' end
0
0
Entering edit mode
5.6 years ago
khcole • 0

Hello,

I am trying to figure out the best way to extract sequences from a FASTA file which begin with a common 5' region of 43 nucleotides. Preferably, I would like to to allow for "fuzziness" in this region to allow for mutations or read overlaps. The idea is that any sequences that begin with this region, regardless of size, could be extracted into a new file.

Example of the FASTA file:

>1-69050-454.08
GTACGGGGAAGGACGTCAATAGTC

>2-65989-433.95
AATCTGTAGTACGACTCACTATAGC

>3-62181-408.91
GATCTGTAATACGACTCACTATAGG

>4-49959-328.53
GGGGAAGGACGTCAATAGTCACAC

And what I would like to get from the code is such:

>2-65989-433.95
AATCTGTAGTACGACTCACTATAGC

>3-62181-408.91
GATCTGTAATACGACTCACTATAGG

The FASTA file is quite large and I have tried using a couple grep and awk methods retrieve the sequences. Any help you could provide is much appreciated.

sequence next-gen • 779 views
ADD COMMENT
1
Entering edit mode

are you familiar with any scripting language? Perl? Python? or linux only?

Also: how fuzzy do you want to be?

ADD REPLY
0
Entering edit mode

I have some knowledge of python and Perl. Most of my knowledge is purely bash.

Been working a little with Biopython but have yet to find a way to extract utilizing that platform.

ADD REPLY
0
Entering edit mode

If you really want to be fuzzy, you'll probably be best of with building a model and screen the sequences with it. I'm thinking of a PWM or and HMM model. If the fuzziness is limited (eg. only a single or a few specific positions) you could construct a simple reg-ex and screen them with awk or grep indeed.

ADD REPLY

Login before adding your answer.

Traffic: 2105 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6