Here, try this, this is a full solution. You'll need to have biopython (Bio) and pandas installed via pip or conda for this to work. collections should be available already.
The code:
import pandas as pd
from Bio import SeqIO
from collections import defaultdict
def get_ukmers(seq, k = 30):
    kmers = [seq[x:x+k] for x in range(0, len(seq)-k+1, k)]
    kout = ";".join(str(kmer) for kmer in kmers)
    return(kout)
def extr_nlkmers(inpfa, outfile, k = 30):
    tmpdict = defaultdict(list)
    with open(inpfa) as fas:
        print("Extracting non-overlapping k-mers from file:", str(inpfa))
        for record in SeqIO.parse(fas, "fasta"):
            print("Opening sequence:", record.name)
            tmpdict['header'].append(record.name)
            tmpdict['seq'].append(''.join(str(nuc) for nuc in record.seq))
            tmpdict['nl_kmers'].append(get_ukmers(record.seq, k))
            print("Done handling sequence:", record.name)
        print("All done.")
        fas.close
    df = pd.DataFrame.from_dict(tmpdict)
    print("Data for all sequences written to file:", outfile)
    df.to_csv(outfile)
    return(df)
You can execute it like so:
#Set a path for an input and output file.
#The input file is a FASTA file.
#The output file is a .CSV file
myin = "test.fasta"
myout = "unique_kmers.csv"
#Extracting non-overlapping k-mers and
#writing them to file.
#The data.frame is also retained in the
#environment
df = extr_nlkmers(myin, myout, k = 30)
df
#Output
# Extracting non-overlapping k-mers from file: test.fasta
# Opening sequence: seq0
# Done handling sequence: seq0
# Opening sequence: seq1
# Done handling sequence: seq1
# Opening sequence: seq2
# Done handling sequence: seq2
# Opening sequence: seq3
# Done handling sequence: seq3
# Opening sequence: seq4
# Done handling sequence: seq4
# Opening sequence: seq5
# Done handling sequence: seq5
# Opening sequence: seq6
# Done handling sequence: seq6
# Opening sequence: seq7
# Done handling sequence: seq7
# Opening sequence: seq8
# Done handling sequence: seq8
# Opening sequence: seq9
# Done handling sequence: seq9
# Opening sequence: seq10
# Done handling sequence: seq10
# Opening sequence: seq11
# Done handling sequence: seq11
# Opening sequence: seq12
# Done handling sequence: seq12
# Opening sequence: seq13
# Done handling sequence: seq13
# Opening sequence: seq14
# Done handling sequence: seq14
# Opening sequence: seq15
# Done handling sequence: seq15
# Opening sequence: seq16
# Done handling sequence: seq16
# Opening sequence: seq17
# Done handling sequence: seq17
# Opening sequence: seq18
# Done handling sequence: seq18
# Opening sequence: seq19
# Done handling sequence: seq19
# Opening sequence: seq20
# Done handling sequence: seq20
# Opening sequence: seq21
# Done handling sequence: seq21
# Opening sequence: seq22
# Done handling sequence: seq22
# Opening sequence: seq23
# Done handling sequence: seq23
# Opening sequence: seq24
# Done handling sequence: seq24
# Opening sequence: seq25
# Done handling sequence: seq25
# Opening sequence: seq26
# Done handling sequence: seq26
# Opening sequence: seq27
# Done handling sequence: seq27
# Opening sequence: seq28
# Done handling sequence: seq28
# Opening sequence: seq29
# Done handling sequence: seq29
# Opening sequence: seq30
# Done handling sequence: seq30
# Opening sequence: seq31
# Done handling sequence: seq31
# Opening sequence: seq32
# Done handling sequence: seq32
# Opening sequence: seq33
# Done handling sequence: seq33
# Opening sequence: seq34
# Done handling sequence: seq34
# Opening sequence: seq35
# Done handling sequence: seq35
# All done.
# Data for all sequences written to file: unique_kmers.csv
And you'll also get a CSV table that looks like this:
# header    seq     nl_kmers
# 0     seq0    AAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTC...   AAGGTTTATACCTTCCCAGGTAACAAACCA;ACCAACTTTCGATCT...
# 1     seq1    TGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTAT...   TGTGTGGCTGTCACTCGGCTGCATGCTTAG;TGCACTCACGCAGTA...
# 2     seq2    GAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG...   GAGTAACTCGTCTATCTTCTGCAGGCTGCT;TACGGTTTCGTCCGT...
# 3     seq3    CGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCA...   CGGGTGTGACCGAAAGGTAAGATGGAGAGC;CTTGTCCCTGGTTTC...
# 4     seq4    TTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTG...   TTTACAGGTTCGCGACGTGCTCGTACGTGG;CTTTGGAGACTCCGT...
# 5     seq5    AAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCC...   AAGATGGCACTTGTGGCTTAGTAGAAGTTG;AAAAAGGCGTTTTGC...
# 6     seq6    TCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGG...   TCGGATGCTCGAACTGCACCTCATGGTCAT;GTTATGGTTGAGCTG...
# 7     seq7    TGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCA...   TGGTGAGACACTTGGTGTCCTTGTCCCTCA;TGTGGGCGAAATACC...
# 8     seq8    GTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTC...   GTAATAAAGGAGCTGGTGGCCATAGTTACG;GCGCCGATCTAAAGT...
# 9     seq9    TATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTG...   TATGAAGATTTTCAAGAAAACTGGAACACT;AAACATAGCAGTGGT...
# 10    seq10   GGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGC...   GGCATACACTCGCTATGTCGATAACAACTT;CTGTGGCCCTGATGG...
# 11    seq11   GTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTAT...   GTGCTGGTAAAGCTTCATGCACTTTGTCCG;AACAACTGGACTTTA...
# 12    seq12   CATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCT...   CATGAGCATGAAATTGCTTGGTACACGGAA;CGTTCTGAAAAGAGC...
# 13    seq13   AAAGAAATTTGACATCTTCAATGGGGAATGTCCAAATTTTGTATTT...   AAAGAAATTTGACATCTTCAATGGGGAATG;TCCAAATTTTGTATT...
# 14    seq14   GGGTTGAAAAGAAAAAGCTTGATGGCTTTATGGGTAGAATTCGATC...   GGGTTGAAAAGAAAAAGCTTGATGGCTTTA;TGGGTAGAATTCGAT...
# 15    seq15   CAAATGTGCCTTTCAACTCTCATGAAGTGTGATCATTGTGGTGAAA...   CAAATGTGCCTTTCAACTCTCATGAAGTGT;GATCATTGTGGTGAA...
# 16    seq16   TTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACT...   TTGCGAATTTTGTGGCACTGAGAATTTGAC;TAAAGAAGGTGCCAC...
# 17    seq17   AAATTTATTGTCCAGCATGTCACAATTCAGAAGTAGGACCTGAGCA...   AAATTTATTGTCCAGCATGTCACAATTCAG;AAGTAGGACCTGAGC...
# 18    seq18   AAAACCATTCTTCGTAAGGGTGGTCGCACTATTGCCTTTGGAGGCT...   AAAACCATTCTTCGTAAGGGTGGTCGCACT;ATTGCCTTTGGAGGC...
# 19    seq19   TGCCTATTGGGTTCCACGTGCTAGCGCTAACATAGGTTGTAACCAT...   TGCCTATTGGGTTCCACGTGCTAGCGCTAA;CATAGGTTGTAACCA...
# 20    seq20   ATGACAACCTTCTTGAAATACTCCAAAAAGAGAAAGTCAACATCAA...   ATGACAACCTTCTTGAAATACTCCAAAAAG;AGAAAGTCAACATCA...
# 21    seq21   GCCATTATTTTGGCATCTTTTTCTGCTTCCACAAGTGCTTTTGTGG...   GCCATTATTTTGGCATCTTTTTCTGCTTCC;ACAAGTGCTTTTGTG...
# 22    seq22   ACAAATTGTTGAATCCTGTGGTAATTTTAAAGTTACAAAAGGAAAA...   ACAAATTGTTGAATCCTGTGGTAATTTTAA;AGTTACAAAAGGAAA...
# 23    seq23   AATCAATACTGAGTCCTCTTTATGCATTTGCATCAGAGGCTGCTCG...   AATCAATACTGAGTCCTCTTTATGCATTTG;CATCAGAGGCTGCTC...
# 24    seq24   ACTGCTCAAAATTCTGTGCGTGTTTTACAGAAGGCCGCTATAACAA...   ACTGCTCAAAATTCTGTGCGTGTTTTACAG;AAGGCCGCTATAACA...
# 25    seq25   CATTGATGCTATGATGTTCACATCTGATTTGGCTACTAACAATCTA...   CATTGATGCTATGATGTTCACATCTGATTT;GGCTACTAACAATCT...
# 26    seq26   AGTTGACTTCGCAGTGGCTAACTAACATCTTTGGCACTGTTTATGA...   AGTTGACTTCGCAGTGGCTAACTAACATCT;TTGGCACTGTTTATG...
# 27    seq27   AAGTTTAAGGAAGGTGTAGAGTTTCTTAGAGACGGTTGGGAAATTG...   AAGTTTAAGGAAGGTGTAGAGTTTCTTAGA;GACGGTTGGGAAATT...
# 28    seq28   CGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTT...   CGGTGGACAAATTGTCACCTGTGCAAAGGA;AATTAAGGAGAGTGT...
# 29    seq29   CTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTAAACTTAAAGC...   CTTTGTGTGCTGACTCTATCATTATTGGTG;GAGCTAAACTTAAAG...
# 30    seq30   AAGGGATTGTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCC...   AAGGGATTGTACAGAAAGTGTGTTAAATCC;AGAGAAGAAACTGGC...
# 31    seq31   CTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAA...   CTTCTTAGAGGGAGAAACACTTCCCACAGA;AGTGTTAACAGAGGA...
# 32    seq32   AACAACCTACTAGTGAAGCTGTTGAAGCTCCATTGGTTGGTACACC...   AACAACCTACTAGTGAAGCTGTTGAAGCTC;CATTGGTTGGTACAC...
# 33    seq33   AAAGACACAGAAAAGTACTGTGCCCTTGCACCTAATATGATGGTAA...   AAAGACACAGAAAAGTACTGTGCCCTTGCA;CCTAATATGATGGTA...
# 34    seq34   AACAAAGGTTACTTTTGGTGATGACACTGTGATAGAAGTGCAAGGT...   AACAAAGGTTACTTTTGGTGATGACACTGT;GATAGAAGTGCAAGG...
# 35    seq35   AAAGGATTGATAAAGTACTTAATGAGAAGTGCTCTGCCTATACAGT...   AAAGGATTGATAAAGTACTTAATGAGAAGT;GCTCTGCCTATACAG...
And here, if what you're worried about are sets of k-mers like AAGGTTTATACCTTCCCAGGTAACAAACCA;ACCAACTTTCGATCTCTTGTAGATCTGTTC (from the very first sequence), these aren't overlapping k-mers. It so happens that the parent sequence has a repeat like so:
"AAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATC"
You're confusing overlaps in the k-mer sequence with overlapping k-mers (overlapping k-mers are generated via a sliding window approach).
                    
                
                 
Andrzej Zielezinski it is still giving overlapping. I want 30-mer non-overlapping.
Can you share your output please? That solution above should absolutely not give you overlapping k-mers.
I am getting overlapping k-mers: AAGGTTTATACCTTCCCAGGTAACAAACCA AGGTTTATACCTTCCCAGGTAACAAACCAA But I dont need AAGGTTTATACCTTCCCAGGTAACAAACCA and after this more 30-mers
As Dunois pointed, you won't get overlapping kmers if you use the suggestion above. Here's an example:
please take this sequence: AAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATC TGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACAC GAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTC CGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGT TTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTA AAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGT TCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAG TGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACG GTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCT TATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGG GGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCAC GTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAA CATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAATTGGC AAAGAAATTTGACATCTTCAATGGGGAATGTCCAAATTTTGTATTTCCCTTAAATTCCATAATCAAGACTATTCAACCAA GGGTTGAAAAGAAAAAGCTTGATGGCTTTATGGGTAGAATTCGATCTGTCTATCCAGTTGCGTCACCAAATGAATGCAAC CAAATGTGCCTTTCAACTCTCATGAAGTGTGATCATTGTGGTGAAACTTCATGGCAGACGGGCGATTTTGTTAAAGCCAC TTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACTACTTGTGGTTACTTACCCCAAAATGCTGTTGTTA AAATTTATTGTCCAGCATGTCACAATTCAGAAGTAGGACCTGAGCATAGTCTTGCCGAATACCATAATGAATCTGGCTTG AAAACCATTCTTCGTAAGGGTGGTCGCACTATTGCCTTTGGAGGCTGTGTGTTCTCTTATGTTGGTTGCCATAACAAGTG TGCCTATTGGGTTCCACGTGCTAGCGCTAACATAGGTTGTAACCATACAGGTGTTGTTGGAGAAGGTTCCGAAGGTCTTA ATGACAACCTTCTTGAAATACTCCAAAAAGAGAAAGTCAACATCAATATTGTTGGTGACTTTAAACTTAATGAAGAGATC GCCATTATTTTGGCATCTTTTTCTGCTTCCACAAGTGCTTTTGTGGAAACTGTGAAAGGTTTGGATTATAAAGCATTCAA ACAAATTGTTGAATCCTGTGGTAATTTTAAAGTTACAAAAGGAAAAGCTAAAAAAGGTGCCTGGAATATTGGTGAACAGA AATCAATACTGAGTCCTCTTTATGCATTTGCATCAGAGGCTGCTCGTGTTGTACGATCAATTTTCTCCCGCACTCTTGAA ACTGCTCAAAATTCTGTGCGTGTTTTACAGAAGGCCGCTATAACAATACTAGATGGAATTTCACAGTATTCACTGAGACT CATTGATGCTATGATGTTCACATCTGATTTGGCTACTAACAATCTAGTTGTAATGGCCTACATTACAGGTGGTGTTGTTC AGTTGACTTCGCAGTGGCTAACTAACATCTTTGGCACTGTTTATGAAAAACTCAAACCCGTCCTTGATTGGCTTGAAGAG AAGTTTAAGGAAGGTGTAGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGT CGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTTCAGACATTCTTTAAGCTTGTAAATAAATTTTTGG CTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTAAACTTAAAGCCTTGAATTTAGGTGAAACATTTGTCACGCACTCA AAGGGATTGTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCCTACTCATGCCTCTAAAAGCCCCAAAAGAAATTAT CTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAAGTTGTCTTGAAAACTGGTGATTTACAACCATTAG AACAACCTACTAGTGAAGCTGTTGAAGCTCCATTGGTTGGTACACCAGTTTGTATTAACGGGCTTATGTTGCTCGAAATC AAAGACACAGAAAAGTACTGTGCCCTTGCACCTAATATGATGGTAACAAACAATACCTTCACACTCAAAGGCGGTGCACC AACAAAGGTTACTTTTGGTGATGACACTGTGATAGAAGTGCAAGGTTACAAGAGTGTGAATATCACTTTTGAACTTGATG AAAGGATTGATAAAGTACTTAATGAGAAGTGCTCTGCCTATACAGTTGAACTCGGTACAGAAGTAAATGAGTTCGCCTGT
smrutimayipanda :
Please use ADD REPLY when responding to existing posts. ADD ANSWERS is meant for new answers to the original question.
Do not post follow-up/additional material as new answers.
above is the sequence in text. Please try with this
Why does your sequence have whitespaces in it? Regardless, running this through
Would produce this:
Where exactly are the overlapping
k-mersin there?Dunois you are running only one command on these sequences. Can you please store all these sequences and then run this command? these is a big fasta file, so I have to run on that
Traceback (most recent call last): File "test.py", line 30, in <module> kmer = [seq[x:x+k] for x in range(0, len(seq)-k+1, k)] TypeError: object of type 'float' has no len()
I am getting this error when I used -k+1 in len(seq). Thats why I am saying run the full script
Why are you using
pandasto read a simple text file? This is likely where your issues are coming in from - not the kmer calculation.then how can i read the fasta file? can you please tell me?
You should use
biopython, but you can use any standard line-by-line python file readon method as long as you handle the headers etc properly.