specific k-mer count in multi fasta file
1
0
Entering edit mode
3.0 years ago
jr5731173 • 0

Hello, I'm trying to count a specific kmer in multifasta files contained in a directory with approximately 65 thousand multifasta files, I have tried with Bioconductor but not it's possible to save all data in memory, and I would like to do this task with a perl or python script. I would like to modify the next script to count the kmer occurrences in the fasta file.

#!/usr/bin/perl -w 
# Searching for motifs 
# Ask the user for the filename of the file containing 
# the protein sequence data, and collect it from the 
keyboard 
print "Please type the filename of the DNA sequence 
data: "; 
$dnafilename = <STDIN>; 
# Remove the newline from the DNA filename 
chomp $dnafilename; 
# open the file, or exit 
unless ( open(DNAFILE, $dnafilename) ) { 
    print "Cannot open file \"$dnafilename\"\n\n"; 
    exit; 
} 
# Read the dna sequence data from the file, and store 
it 
# into the array variable @protein 
@dna = <DNAFILE>; 
# Close the file - we've read all the data into @dna 
now. 
close DNAFILE; 
# Put the DNA sequence data into a single string, as 
it's easier 
# to search for a motif in a string than in an array of 
# lines (what if the motif occurs over a line break?) 
$dna = join( '', @dna); 
# Remove whitespace 
$dna =~ s/\s//g; 
# In a loop, ask the user for a motif, search for the motif, 
# and report if it was found. 
# Exit if no motif is entered. 
do { 
    print "Enter a motif to search for: "; 
    $motif = <STDIN>; 
    # Remove the newline at the end of $motif 
    chomp $motif; 
    # Look for the motif 
    if ( $dna =~ /$motif/ ) { 
        print "I found it!\n\n"; 
    } else { 
        print "I couldn\'t find it.\n\n"; 
    } 
# exit on an empty user input 
} until ( $motif =~ /^\s*$/ ); 
# exit the program 
exit;
sequence • 1.6k views
ADD COMMENT
0
Entering edit mode

Isn't it better to use Jellyfish, KMC, KmerCountExact, or another ready-made solution?

ADD REPLY
2
Entering edit mode
3.0 years ago

You could use kmer-counter and do something like:

$ ./kmer-counter --fasta --k=8 sequences.fa | grep -wF 'ACCCTGGC' > answer.txt

Adjust for your k-length and mer of interest.

If you want to use Python, there's a test script that demonstrates how to render mer-counts from kmer-counter into a Python dictionary, over which you can iterate and access counts directly for your kmer of interest.

ADD COMMENT
0
Entering edit mode

Do you have a conda installer? Thanks

ADD REPLY
1
Entering edit mode

No, but installation is easy:

$ git clone https://github.com/alexpreynolds/kmer-counter.git
$ cd kmer-counter
$ make
$ cp kmer-counter /usr/bin/local
$ kmer-counter --help
...

Hope this is still useful. If you need something via conda, there is jellyfish: https://anaconda.org/conda-forge/jellyfish

ADD REPLY
0
Entering edit mode

Thanks bro... :)

ADD REPLY

Login before adding your answer.

Traffic: 2047 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6