Question: removing duplicate sequences while extarcting the reads from fastq.gz
0
gravatar for abdul.karim
12 months ago by
abdul.karim0 wrote:

I can extract the reads from fastq.gz file as follows.

 gunzip -c in.fastq.gz | awk '(NR%4==2)' > out.seq

Is there anyway, that I only extract the unique reads and discard any duplicate reads?

sequencing sequence genome • 533 views
ADD COMMENTlink modified 12 months ago by swbarnes28.6k • written 12 months ago by abdul.karim0
3

How do you define a duplicate read? Same sequence? Same identifier? Same sequence and quality? All of those? How did you end up with duplicate reads?

ADD REPLYlink modified 12 months ago • written 12 months ago by WouterDeCoster44k
1

I don't know if I understand your question but you can use Picard's MarkDuplicates (check on manual) to remove duplicated reads!

ADD REPLYlink written 12 months ago by brunobsouzaa290
1
gravatar for gb
12 months ago by
gb1.9k
gb1.9k wrote:
  1. quality trim out.seq
  2. Check length distribution
  3. trim all reads to the same length
  4. use vsearch --derep_fulllength
ADD COMMENTlink written 12 months ago by gb1.9k
1
gravatar for swbarnes2
12 months ago by
swbarnes28.6k
United States
swbarnes28.6k wrote:

Are you sure you want to do this at the fastq level? (I don't understand why you want to do this at all) Do you really want to count every sequence with a one-off error as a unique sequence?

The typical thing to do would be to align your reads to their reference, then use picardtools MarkDuplicates.

But if you really want to get unique sequences in the raw fastq:

zcat my.fastq.gz | awk 'NR%4==2' | awk '!x[$0]++' > unique.txt
ADD COMMENTlink written 12 months ago by swbarnes28.6k
1

A reference may not always be available.

Would that awk solution scale well if one has millions of reads? This is where clumpify comes in handy.

ADD REPLYlink written 12 months ago by genomax90k
1

I haven't tested. Its virtue is you don't have to install any software. It might eat up a lot of memory; since it's not sorting, I guess it remembers every sequence it saw.

ADD REPLYlink written 12 months ago by swbarnes28.6k
0
gravatar for genomax
12 months ago by
genomax90k
United States
genomax90k wrote:

Use clumpify.sh from BBMap suite. You can use fastq data as is. I suggest you do no other manipulations. See: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

You can choose to allow one or more errors. Separate PCR/optical duplicates.

ADD COMMENTlink written 12 months ago by genomax90k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1235 users visited in the last hour