Question: greping unique reads from fastq file
0
gravatar for sumithrasank75
2.9 years ago by
United States
sumithrasank7590 wrote:

I have a fastq file with reads, but there are duplicates. Can you tell me how I can get the unique entires in the 4-row fastq format

sequencing next-gen • 1.5k views
ADD COMMENTlink modified 2.9 years ago by Brian Bushnell15k • written 2.9 years ago by sumithrasank7590

You might want to clarify what you mean by "duplicate" in this case. Do you mean that they have the same sequence or that you have a single read from the machine duplicated multiple times?

BTW, the former situation is addressed by RAM's answer, the latter by Pierre's.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Devon Ryan79k
1
gravatar for Pierre Lindenbaum
2.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum107k wrote:
gunzip -c in.fq.gz | paste - - - - | LC_ALL=C sort -t '\t' -k2,2 -u | tr "\t" "\n" | gzip > out.fq.gz
ADD COMMENTlink written 2.9 years ago by Pierre Lindenbaum107k

when I run this I get an error :

sort: multi-character tab `\\t'.

Also, I have a plain .fastq not the compressed fastq.gz

ADD REPLYlink written 2.9 years ago by sumithrasank7590
2

Try

cat in.fq | paste - - - - | LC_ALL=C sort -t$'\t' -k2,2 -u | tr "\t" "\n" | gzip > out.fq.gz

Waiting for someone to write uuoc :)

ADD REPLYlink written 2.9 years ago by Sukhdeep Singh9.3k

Thanks, this works

ADD REPLYlink written 2.9 years ago by sumithrasank7590

yeah, I used '\t' to  show you it's a tab....

ADD REPLYlink written 2.9 years ago by Pierre Lindenbaum107k

Thanks for clarifying this

ADD REPLYlink written 2.9 years ago by sumithrasank7590
1
gravatar for Brian Bushnell
2.9 years ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

With the BBMap package:

dedupe.sh in=reads.fq out=nodupes.fq

The output will contain exactly 1 copy of every unique sequence.  It's extremely fast, but may take more memory than other solutions - the amount of memory is proportional to the number of unique reads (rather than, say, the total input size).

ADD COMMENTlink written 2.9 years ago by Brian Bushnell15k
0
gravatar for Ram
2.9 years ago by
Ram15k
New York
Ram15k wrote:

PRINSEQ should solve your problem. Check it out here: http://prinseq.sourceforge.net/manual.html#QCDUPLICATION

A bit of digging should get you the command line options for the feature.

ADD COMMENTlink written 2.9 years ago by Ram15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1486 users visited in the last hour