Question: greping unique reads from fastq file
0
gravatar for sumithrasank75
2.7 years ago by
United States
sumithrasank7580 wrote:

I have a fastq file with reads, but there are duplicates. Can you tell me how I can get the unique entires in the 4-row fastq format

sequencing next-gen • 1.3k views
ADD COMMENTlink modified 2.7 years ago by Brian Bushnell15k • written 2.7 years ago by sumithrasank7580

You might want to clarify what you mean by "duplicate" in this case. Do you mean that they have the same sequence or that you have a single read from the machine duplicated multiple times?

BTW, the former situation is addressed by RAM's answer, the latter by Pierre's.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Devon Ryan76k
1
gravatar for Pierre Lindenbaum
2.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum104k wrote:
gunzip -c in.fq.gz | paste - - - - | LC_ALL=C sort -t '\t' -k2,2 -u | tr "\t" "\n" | gzip > out.fq.gz
ADD COMMENTlink written 2.7 years ago by Pierre Lindenbaum104k

when I run this I get an error :

sort: multi-character tab `\\t'.

Also, I have a plain .fastq not the compressed fastq.gz

ADD REPLYlink written 2.7 years ago by sumithrasank7580
2

Try

cat in.fq | paste - - - - | LC_ALL=C sort -t$'\t' -k2,2 -u | tr "\t" "\n" | gzip > out.fq.gz

Waiting for someone to write uuoc :)

ADD REPLYlink written 2.7 years ago by Sukhdeep Singh9.2k

Thanks, this works

ADD REPLYlink written 2.7 years ago by sumithrasank7580

yeah, I used '\t' to  show you it's a tab....

ADD REPLYlink written 2.7 years ago by Pierre Lindenbaum104k

Thanks for clarifying this

ADD REPLYlink written 2.7 years ago by sumithrasank7580
1
gravatar for Brian Bushnell
2.7 years ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

With the BBMap package:

dedupe.sh in=reads.fq out=nodupes.fq

The output will contain exactly 1 copy of every unique sequence.  It's extremely fast, but may take more memory than other solutions - the amount of memory is proportional to the number of unique reads (rather than, say, the total input size).

ADD COMMENTlink written 2.7 years ago by Brian Bushnell15k
0
gravatar for Ram
2.7 years ago by
Ram13k
New York
Ram13k wrote:

PRINSEQ should solve your problem. Check it out here: http://prinseq.sourceforge.net/manual.html#QCDUPLICATION

A bit of digging should get you the command line options for the feature.

ADD COMMENTlink written 2.7 years ago by Ram13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 881 users visited in the last hour