Question

removing duplicate sequences while extarcting the reads from fastq.gz

0

Entering edit mode

5.1 years ago

abdul.karim • 0

I can extract the reads from fastq.gz file as follows.

 gunzip -c in.fastq.gz | awk '(NR%4==2)' > out.seq

Is there anyway, that I only extract the unique reads and discard any duplicate reads?

genome sequencing sequence • 2.5k views

ADD COMMENT • link updated 5.1 years ago by swbarnes2 14k • written 5.1 years ago by abdul.karim • 0

3

Entering edit mode

How do you define a duplicate read? Same sequence? Same identifier? Same sequence and quality? All of those? How did you end up with duplicate reads?

ADD REPLY • link 5.1 years ago by WouterDeCoster 47k

1

Entering edit mode

I don't know if I understand your question but you can use Picard's MarkDuplicates (check on manual) to remove duplicated reads!

ADD REPLY • link 5.1 years ago by brunobsouzaa ▴ 830

score 1 · Answer 1 · 2019-09-19

1

Entering edit mode

5.1 years ago

gb ★ 2.2k

quality trim out.seq
Check length distribution
trim all reads to the same length
use vsearch --derep_fulllength

ADD COMMENT • link 5.1 years ago by gb ★ 2.2k

score 1 · Answer 2 · 2019-09-19

1

Entering edit mode

5.1 years ago

swbarnes2 14k

Are you sure you want to do this at the fastq level? (I don't understand why you want to do this at all) Do you really want to count every sequence with a one-off error as a unique sequence?

The typical thing to do would be to align your reads to their reference, then use picardtools MarkDuplicates.

But if you really want to get unique sequences in the raw fastq:

zcat my.fastq.gz | awk 'NR%4==2' | awk '!x[$0]++' > unique.txt

ADD COMMENT • link 5.1 years ago by swbarnes2 14k

1

Entering edit mode

A reference may not always be available.

Would that awk solution scale well if one has millions of reads? This is where clumpify comes in handy.

ADD REPLY • link 5.1 years ago by GenoMax 147k

1

Entering edit mode

I haven't tested. Its virtue is you don't have to install any software. It might eat up a lot of memory; since it's not sorting, I guess it remembers every sequence it saw.

ADD REPLY • link 5.1 years ago by swbarnes2 14k

score 0 · Answer 3 · 2019-09-19

0

Entering edit mode

5.1 years ago

GenoMax 147k

Use clumpify.sh from BBMap suite. You can use fastq data as is. I suggest you do no other manipulations. See: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

You can choose to allow one or more errors. Separate PCR/optical duplicates.

ADD COMMENT • link 5.1 years ago by GenoMax 147k