Question: Extract 6mer and headers from fasta and compile into list
gravatar for omansn
5.5 years ago by
United States
omansn10 wrote:

I'm hoping you all can help me with a problem I have. 

I have a fasta file of ~1300 genes, and I'm looking for all instances of 7 different 6nt sequences. I'm hoping to use my list of 7seven 6mers to query the fasta and extract the matches sequence (+/- about 7nt around it) as well as the match's respective fasta header.

This seems like something that would be really easy to do for someone who is good at programming. Unfortunately, i'm not. I've tried a couple different things— First I tried using grep, but the inflexibility of grep made it difficult to compile the data, especially keeping it attached to the FASTA header. 

Next, I tried the following perl script (remember, I can't program) and I get errors each time. 

use strict;
use Bio::Seq;
use Bio::DB::Fasta;

my $fastaFile = shift;
my $queryFile = shift;

my $db = Bio::DB::Fasta->new( $fastaFile );
open (IN, $queryFile);
while (<IN>){
    $seq = $_;
    my $sequence = $db->seq($seq);
    if  (!defined( $sequence )) {
            die "Sequence $seq not found. \n"
    print ">$seq\n", "$sequence\n";

This gives me the output of :

    Global symbol "$seq" requires explicit package name at line X.

for every line that contains $seq. 

Now, i've tried adjusting my PATH to contain all bioperl modules (I imagine, $seq is contained in one) to no avail.

Can someone help me out here? I'm open to new ideas of how to generate my list, adjustments to the perl script, or reappropriation of other tools. 

Thank you!

Stressed and under pressure from the boss :)

sequence gene • 1.8k views
ADD COMMENTlink modified 5.5 years ago by zhuwei.cug30 • written 5.5 years ago by omansn10
gravatar for Dan D
5.5 years ago by
Dan D7.1k
Dan D7.1k wrote:

I don't use Perl anymore, but here's something I wrote in Python. It's a 10-minute effort, but seems to work fine. Put your target sequences in a text file, one target sequence per line. Call the program like this:

python fasta_target_extractor target_seqs.txt my_genome.fasta

For each match it will print the header of the matching FASTA contig, and the target sequence match plus up to 7 bases from each side of the matched target sequence.

from sys import argv

#run the application like this:
#python target_seqs.txt fasta_file.fasta

#The target_seqs list holds the 6-mers you're querying on
target_seqs = []
with open(argv[1]) as tsf:
    for read in tsf:

current_header = ''
with open(argv[2]) as faf:
    for read in faf:
        if read.startswith('>'):
            current_header = read.strip()
        for ts in target_seqs:
            ts_index = read.find(ts)
            if ts_index != -1:
                r = read.strip()
                start = 0 if ts_index - 7 < 0 else ts_index - 7
                end = len(r) - 1 if ts_index + len(ts) + 7 >= len(r) else ts_index + len(ts) + 7
                print current_header
                print r[start:end]
ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Dan D7.1k

You're so nice! I'll run it tomorrow and let you know how it goes! Thanks Dan.
In general, would you recommend that I focus on Python instead of Perl for bioinformatics? While Python seems to be taking over, a lot of the tools out there are in Perl.


ADD REPLYlink written 5.5 years ago by omansn10

Uh oh, you opened that door. I think it's largely a matter of personal preference. The Python vs. Perl debate leaves lots of bodies in its wake. Here are some of the best biostars threads about it:

When To Choose Python Over Perl (And Vice Versa)?

I want to re-open the old debate: python or perl ?

ADD REPLYlink written 5.5 years ago by Dan D7.1k

Ahah! it worked! I am so grateful for your help, Dan. You have no idea how relieved I am! 

ADD REPLYlink written 5.5 years ago by omansn10
gravatar for Brian Bushnell
5.5 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

I don't have an exact solution, but you can extract the sequences with BBDuk like this: in=data.fasta out=filtered.fasta k=6 mm=f rcomp=f literal=AAAAAA,AAATTT,CCCGGG

It's also possible to change the matching parts of the sequences to lowercase, which may be helpful, with the flag "kmask=lc".  You can mask an additional 7 bases around each matching kmer with the flag "trimpad=7".

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Brian Bushnell17k
gravatar for zhuwei.cug
5.5 years ago by
zhuwei.cug30 wrote:

since you have enable the 'strict' syntax mode via 'use strict', you have to declare every variable 

such as:

$seq="ATGC" is wrong, it should be  my $seq="ATGC"

so in your script, change

$seq = $_;  
my $seq = $_;

will eliminate the errors

ADD COMMENTlink written 5.5 years ago by zhuwei.cug30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1064 users visited in the last hour