Question: Finding specific k-mer in human genome
1
gravatar for jye
3.4 years ago by
jye10
jye10 wrote:

I want to find a specific 9-mer (GATCGATGC) in human genome, and then export them into a bed file with all information including chromosome, start and end position. A lot of tools such as jellyfish and DSK can only count k mer occurrence and can't export k mer information. Does anybody know how to do this? Any suggestion would be greatly appreciated.

list all coordinates bed k-mer • 2.2k views
ADD COMMENTlink modified 3.4 years ago by Alex Reynolds28k • written 3.4 years ago by jye10
1

Do you mean you just want to search the string "GATCGATGC" across the genome fasta and get the coordinates ?

ADD REPLYlink written 3.4 years ago by geek_y9.8k

This is probably the best thing to do, because if a read starts with "ATCGATGC" (no G at the beginning) then it is probably still relevant information to you. It is therefore probably best to find the genomic regions for GATCGATGC, then count the reads that fall anywhere over those regions, rather than the much more expensive computation of GATCGATGC in reads (with mismatches, etc)

ADD REPLYlink written 3.4 years ago by John12k

Yes. That's what I want to do

ADD REPLYlink written 3.4 years ago by jye10

Perhaps you can simply use (and edit) one of the AWK commands that I posted in a previous answer: A: Correct statistical test to determine the significance of nucleotides present

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe46k
5
gravatar for Asaf
3.4 years ago by
Asaf6.1k
Israel
Asaf6.1k wrote:

EMBOSS has the tool fuzznuc, you can execute it in Galaxy and then convert the output to the desired format. Fuzznuc has several output formats, such as table or gff, one of them should work for you.

ADD COMMENTlink written 3.4 years ago by Asaf6.1k

good to know. Have not come across this before.

ADD REPLYlink written 3.4 years ago by geek_y9.8k

That's a great tool. Solved my problem! Thank you!

ADD REPLYlink written 3.4 years ago by jye10
1
gravatar for Alex Reynolds
3.4 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

UCSC BLAT is not ideal for short sequences, but a command-line version of BLAT could be used locally with a small tile size and options -minMatch and -minIdentity to export a PSL file, and from there, a conversion script like psl2bed can be used to get a BED file for downstream set operations.

ADD COMMENTlink written 3.4 years ago by Alex Reynolds28k

Thank you. Good to know.

ADD REPLYlink written 3.4 years ago by jye10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1699 users visited in the last hour