Question: specific k-mer count in multi fasta file
gravatar for jr5731173
4 months ago by
jr57311730 wrote:

Hello, I'm trying to count a specific kmer in multifasta files contained in a directory with approximately 65 thousand multifasta files, I have tried with Bioconductor but not it's possible to save all data in memory, and I would like to do this task with a perl or python script. I would like to modify the next script to count the kmer occurrences in the fasta file.

#!/usr/bin/perl -w 
# Searching for motifs 
# Ask the user for the filename of the file containing 
# the protein sequence data, and collect it from the 
print "Please type the filename of the DNA sequence 
data: "; 
$dnafilename = <STDIN>; 
# Remove the newline from the DNA filename 
chomp $dnafilename; 
# open the file, or exit 
unless ( open(DNAFILE, $dnafilename) ) { 
    print "Cannot open file \"$dnafilename\"\n\n"; 
# Read the dna sequence data from the file, and store 
# into the array variable @protein 
@dna = <DNAFILE>; 
# Close the file - we've read all the data into @dna 
close DNAFILE; 
# Put the DNA sequence data into a single string, as 
it's easier 
# to search for a motif in a string than in an array of 
# lines (what if the motif occurs over a line break?) 
$dna = join( '', @dna); 
# Remove whitespace 
$dna =~ s/\s//g; 
# In a loop, ask the user for a motif, search for the motif, 
# and report if it was found. 
# Exit if no motif is entered. 
do { 
    print "Enter a motif to search for: "; 
    $motif = <STDIN>; 
    # Remove the newline at the end of $motif 
    chomp $motif; 
    # Look for the motif 
    if ( $dna =~ /$motif/ ) { 
        print "I found it!\n\n"; 
    } else { 
        print "I couldn\'t find it.\n\n"; 
# exit on an empty user input 
} until ( $motif =~ /^\s*$/ ); 
# exit the program 
sequence • 280 views
ADD COMMENTlink modified 4 months ago by Alex Reynolds26k • written 4 months ago by jr57311730

Isn't it better to use Jellyfish, KMC, KmerCountExact, or another ready-made solution?

ADD REPLYlink modified 4 months ago • written 4 months ago by h.mon21k
gravatar for Alex Reynolds
4 months ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

You could use kmer-counter and do something like:

$ ./kmer-counter --fasta --k=8 sequences.fa | grep -wF 'ACCCTGGC' > answer.txt

Adjust for your k-length and mer of interest.

If you want to use Python, there's a test script that demonstrates how to render mer-counts from kmer-counter into a Python dictionary, over which you can iterate and access counts directly for your kmer of interest.

ADD COMMENTlink modified 4 months ago • written 4 months ago by Alex Reynolds26k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 547 users visited in the last hour