Question: greping unique reads from fastq file
0
gravatar for sumithrasank75
2.5 years ago by
United States
sumithrasank7570 wrote:

I have a fastq file with reads, but there are duplicates. Can you tell me how I can get the unique entires in the 4-row fastq format

sequencing next-gen • 1.2k views
ADD COMMENTlink modified 2.5 years ago by Brian Bushnell15k • written 2.5 years ago by sumithrasank7570

You might want to clarify what you mean by "duplicate" in this case. Do you mean that they have the same sequence or that you have a single read from the machine duplicated multiple times?

BTW, the former situation is addressed by RAM's answer, the latter by Pierre's.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Devon Ryan73k
1
gravatar for Pierre Lindenbaum
2.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum102k wrote:
gunzip -c in.fq.gz | paste - - - - | LC_ALL=C sort -t '\t' -k2,2 -u | tr "\t" "\n" | gzip > out.fq.gz
ADD COMMENTlink written 2.5 years ago by Pierre Lindenbaum102k

when I run this I get an error :

sort: multi-character tab `\\t'.

Also, I have a plain .fastq not the compressed fastq.gz

ADD REPLYlink written 2.5 years ago by sumithrasank7570
2

Try

cat in.fq | paste - - - - | LC_ALL=C sort -t$'\t' -k2,2 -u | tr "\t" "\n" | gzip > out.fq.gz

Waiting for someone to write uuoc :)

ADD REPLYlink written 2.5 years ago by Sukhdeep Singh9.1k

Thanks, this works

ADD REPLYlink written 2.5 years ago by sumithrasank7570

yeah, I used '\t' to  show you it's a tab....

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum102k

Thanks for clarifying this

ADD REPLYlink written 2.5 years ago by sumithrasank7570
1
gravatar for Brian Bushnell
2.5 years ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

With the BBMap package:

dedupe.sh in=reads.fq out=nodupes.fq

The output will contain exactly 1 copy of every unique sequence.  It's extremely fast, but may take more memory than other solutions - the amount of memory is proportional to the number of unique reads (rather than, say, the total input size).

ADD COMMENTlink written 2.5 years ago by Brian Bushnell15k
0
gravatar for Ram
2.5 years ago by
Ram12k
New York
Ram12k wrote:

PRINSEQ should solve your problem. Check it out here: http://prinseq.sourceforge.net/manual.html#QCDUPLICATION

A bit of digging should get you the command line options for the feature.

ADD COMMENTlink written 2.5 years ago by Ram12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1030 users visited in the last hour