Question: Software to identify overrepresented k-mers in sequencing data
0
gravatar for abascalfederico
3.0 years ago by
abascalfederico1.1k
Spain
abascalfederico1.1k wrote:

Hi all,

I need to identify overrepresented k-mers in sequencing data. Ideally, I would need k-mers of lengths between 7 and 20 (I am searching for some sequencing adaptors remnants).

Anyone knows of a program able to do this?

Thanks! Federico

sequencing k-mer • 1.1k views
ADD COMMENTlink modified 3.0 years ago by Pierre Lindenbaum122k • written 3.0 years ago by abascalfederico1.1k
2
gravatar for Pierre Lindenbaum
3.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

fastqc with option kmer: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/11%20Kmer%20Content.html

   -k --kmers       Specifies the length of Kmer to look for in the Kmer content
                    module. Specified Kmer length must be between 2 and 10. Default
                    length is 7 if not specified.
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Pierre Lindenbaum122k

Thanks Pierre! That may be helpful but I would like to be able to search for longer kmers (up to 20 bps)

ADD REPLYlink written 3.0 years ago by abascalfederico1.1k
2

fastqc is a shell script, change the following lines:

if ($kmer_size) {
    unless ($kmer_size =~ /^\d+$/) {
        die "Kmer size '$kmer_size' was not a number";
    }
    #### CHANGE 10 to WHATEVER...
    if ($kmer_size < 2 or $kmer_size > 10) {
        die "Kmer size must be in the range 2-10";
    }

    push @java_args,"-Dfastqc.kmer_size=$kmer_size";
}

use at your own risk.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Pierre Lindenbaum122k

Minimum at 2 obviously makes sense. Any idea why they hard-coded the maximum at 10?

ADD REPLYlink written 3.0 years ago by igor8.1k
1

because 10 is not 'too much' in memory: there is potentialy 4^10= 10,48,576 unique keys in the map. k=20 would be : 1,099,511,627,776 ==> OUT OF MEMORY.

ADD REPLYlink written 3.0 years ago by Pierre Lindenbaum122k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 733 users visited in the last hour