greping unique reads from fastq file
3
0
Entering edit mode
8.9 years ago

I have a fastq file with reads, but there are duplicates. Can you tell me how I can get the unique entires in the 4-row fastq format

sequencing next-gen • 5.2k views
ADD COMMENT
0
Entering edit mode

You might want to clarify what you mean by "duplicate" in this case. Do you mean that they have the same sequence or that you have a single read from the machine duplicated multiple times?

BTW, the former situation is addressed by RAM's answer, the latter by Pierre's.

ADD REPLY
1
Entering edit mode
8.9 years ago
gunzip -c in.fq.gz | paste - - - - | LC_ALL=C sort -t '\t' -k2,2 -u | tr "\t" "\n" | gzip > out.fq.gz
ADD COMMENT
0
Entering edit mode

When I run this I get an error:

sort: multi-character tab `\\t'

Also, I have a plain .fastq not the compressed fastq.gz

ADD REPLY
2
Entering edit mode

Try

cat in.fq | paste - - - - | LC_ALL=C sort -t$'\t' -k2,2 -u | tr "\t" "\n" | gzip > out.fq.gz

Waiting for someone to write uuoc :)

ADD REPLY
0
Entering edit mode

Thanks, this works

ADD REPLY
0
Entering edit mode

Yeah, I used '\t' to show you it's a tab....

ADD REPLY
0
Entering edit mode

Thanks for clarifying this

ADD REPLY
1
Entering edit mode
8.9 years ago

With the BBMap package:

dedupe.sh in=reads.fq out=nodupes.fq

The output will contain exactly 1 copy of every unique sequence. It's extremely fast, but may take more memory than other solutions - the amount of memory is proportional to the number of unique reads (rather than, say, the total input size).

ADD COMMENT
0
Entering edit mode
8.9 years ago
Ram 43k

PRINSEQ should solve your problem. Check it out here: http://prinseq.sourceforge.net/manual.html#QCDUPLICATION

A bit of digging should get you the command line options for the feature.

ADD COMMENT

Login before adding your answer.

Traffic: 2475 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6