Question: removing duplicate sequences while extarcting the reads from fastq.gz
0
gravatar for abdul.karim
4 weeks ago by
abdul.karim0 wrote:

I can extract the reads from fastq.gz file as follows.

 gunzip -c in.fastq.gz | awk '(NR%4==2)' > out.seq

Is there anyway, that I only extract the unique reads and discard any duplicate reads?

sequencing sequence genome • 149 views
ADD COMMENTlink modified 4 weeks ago by swbarnes26.7k • written 4 weeks ago by abdul.karim0
3

How do you define a duplicate read? Same sequence? Same identifier? Same sequence and quality? All of those? How did you end up with duplicate reads?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by WouterDeCoster41k
1

I don't know if I understand your question but you can use Picard's MarkDuplicates (check on manual) to remove duplicated reads!

ADD REPLYlink written 4 weeks ago by brunobsouzaa50
1
gravatar for gb
4 weeks ago by
gb1.0k
gb1.0k wrote:
  1. quality trim out.seq
  2. Check length distribution
  3. trim all reads to the same length
  4. use vsearch --derep_fulllength
ADD COMMENTlink written 4 weeks ago by gb1.0k
1
gravatar for swbarnes2
4 weeks ago by
swbarnes26.7k
United States
swbarnes26.7k wrote:

Are you sure you want to do this at the fastq level? (I don't understand why you want to do this at all) Do you really want to count every sequence with a one-off error as a unique sequence?

The typical thing to do would be to align your reads to their reference, then use picardtools MarkDuplicates.

But if you really want to get unique sequences in the raw fastq:

zcat my.fastq.gz | awk 'NR%4==2' | awk '!x[$0]++' > unique.txt
ADD COMMENTlink written 4 weeks ago by swbarnes26.7k
1

A reference may not always be available.

Would that awk solution scale well if one has millions of reads? This is where clumpify comes in handy.

ADD REPLYlink written 4 weeks ago by genomax73k
1

I haven't tested. Its virtue is you don't have to install any software. It might eat up a lot of memory; since it's not sorting, I guess it remembers every sequence it saw.

ADD REPLYlink written 4 weeks ago by swbarnes26.7k
0
gravatar for genomax
4 weeks ago by
genomax73k
United States
genomax73k wrote:

Use clumpify.sh from BBMap suite. You can use fastq data as is. I suggest you do no other manipulations. See: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

You can choose to allow one or more errors. Separate PCR/optical duplicates.

ADD COMMENTlink written 4 weeks ago by genomax73k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2362 users visited in the last hour