Question

Binning Illumina reads by subread?

0

Entering edit mode

7.0 years ago

stacy734 ▴ 40

Hi everyone,

I have a very large file of genomic sequence, and need to bin them by the presence or absence of a specific 20-mer. (Everything with the 20-mer in one file, everything without it in another). I tried using grep but the sequences are multi-line.

Any suggestions will be appreciated.

Stacy

next-gen illumina binning fasta • 1.4k views

ADD COMMENT • link 7.0 years ago by stacy734 ▴ 40

2

Entering edit mode

Your data is in fastq I assume? There is grep -A2 -B1

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

2

Entering edit mode

Try bbduk.sh from BBMap suite. If sequences are in fasta format then they should still work.

bbduk.sh -Xmx1g in=reads.fq out=unmatched.fq outm=matched.fq literal=your_20_mer_sequence k=10

if paired-end then

bbduk.sh -Xmx1g in1=r1.fq.gz in2=r2.fq.gz out1=unmatched1.fq.gz out2=unmatched2.fq.gz outm1=matched1.fq.gz outm2=matched2.fq.gz literal=your_20_mer_sequence k=10

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

Thanks very much!

Science marches on...

Stacy

ADD REPLY • link 7.0 years ago by stacy734 ▴ 40

score 2 · Answer 1 · 2017-04-30

BBDuk will do what you want (and possibly more):

bbduk.sh k=20 in=genomic.fasta out=without_kmer.fasta outm=with_kmer.fasta literal=ATCGATCGATCGATCG

or

bbduk.sh k=20 in=genomic.fasta out=without_kmer.fasta outm=with_kmer.fasta ref=kmer.fasta

To allow for one mismatch:

bbduk.sh k=20 hdist=1 in=genomic.fasta out=without_kmer.fasta outm=with_kmer.fasta ref=kmer.fasta