I want to find a specific 9-mer (GATCGATGC) in human genome, and then export them into a bed file with all information including chromosome, start and end position. A lot of tools such as jellyfish and DSK can only count k mer occurrence and can't export k mer information. Does anybody know how to do this? Any suggestion would be greatly appreciated.
Do you mean you just want to search the string "GATCGATGC" across the genome fasta and get the coordinates ?
This is probably the best thing to do, because if a read starts with "ATCGATGC" (no G at the beginning) then it is probably still relevant information to you. It is therefore probably best to find the genomic regions for GATCGATGC, then count the reads that fall anywhere over those regions, rather than the much more expensive computation of GATCGATGC in reads (with mismatches, etc)
Yes. That's what I want to do
Perhaps you can simply use (and edit) one of the AWK commands that I posted in a previous answer: A: Correct statistical test to determine the significance of nucleotides present