I can extract the reads from fastq.gz file as follows.
gunzip -c in.fastq.gz | awk '(NR%4==2)' > out.seq
Is there anyway, that I only extract the unique reads and discard any duplicate reads?
I can extract the reads from fastq.gz file as follows.
gunzip -c in.fastq.gz | awk '(NR%4==2)' > out.seq
Is there anyway, that I only extract the unique reads and discard any duplicate reads?
vsearch --derep_fulllength
Are you sure you want to do this at the fastq level? (I don't understand why you want to do this at all) Do you really want to count every sequence with a one-off error as a unique sequence?
The typical thing to do would be to align your reads to their reference, then use picardtools MarkDuplicates.
But if you really want to get unique sequences in the raw fastq:
zcat my.fastq.gz | awk 'NR%4==2' | awk '!x[$0]++' > unique.txt
Use clumpify.sh
from BBMap suite. You can use fastq data as is. I suggest you do no other manipulations. See: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files
You can choose to allow one or more errors. Separate PCR/optical duplicates.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
How do you define a duplicate read? Same sequence? Same identifier? Same sequence and quality? All of those? How did you end up with duplicate reads?
I don't know if I understand your question but you can use Picard's MarkDuplicates (check on manual) to remove duplicated reads!