Is There A Fastq Alternative To Fastx_Collapser (Outputs Fasta)?
12.2 years ago
Yannick Wurm ★ 2.4k

fastx_collapser seems to convert my fastq files to fasta. That's not cool.

cat a
@HWI-ST132_0395:8:1:1177:1888#ATCTNC/1
ATACATATATCAGCATAAAGGTGTTCACAGGTCATCATGAGGGATCAGTTTGTAGCAATTACGGAGGTCACGAGATCGGACGAGCGGTTGCGCA
+HWI-ST132_0395:8:1:1177:1888#ATCTNC/1
d^ddddeccce\eedddac^JW\XZLL]\\TYHNVZQ__L\P_^a_^\^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-ST132_0395:8:1:1048:1897#ATNTNN/1
GTGGATTCCGGGGGAATGGGGAGCGGGACGATGTGAAAGGAGCGGGAAGGGGGCGGAAGCGCGGCACAGTCGGCAGGCAGAGTTGCTAGAACAG
+HWI-ST132_0395:8:1:1048:1897#ATNTNN/1
ccacTccbcccYbU^YM^\L^\\Z^\P]]YLUJ]VOaQ_U]^aBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

fastx_collapser -i a
>1-1
GTGGATTCCGGGGGAATGGGGAGCGGGACGATGTGAAAGGAGCGGGAAGGGGGCGGAAGCGCGGCACAGTCGGCAGGCAGAGTTGCTAGAACAG
>2-1
ATACATATATCAGCATAAAGGTGTTCACAGGTCATCATGAGGGATCAGTTTGTAGCAATTACGGAGGTCACGAGATCGGACGAGCGGTTGCGCA


Is there an alternative collapser?

What quality score would you want to see in cases with multiple identical sequences? That's probably the hard problem Assaf was trying to avoid by outputting as FASTA.

yes, that's a pain. i just write a quick throw-away script to do it. you could try emailing the author of fastx toolkit and asking. but i'd like to see a solution with awk/sed. :-)

I'd be happy with anything :) Be it a random choice, or the quality scores of the sequence with the highest overall quality...

Why do you need the output to be fastq? I'd be wary of using random (or at least not entirely correct) quality scores in downstream processing... If you're planning on aligning next, I think most aligners with take fasta input (I know bowtie and novoalign do).

12.2 years ago
brentp 24k

which when run as:

./fastq filter --adjust 64 --unique /path/to/your.fasta > unique.fasta


will keep the records with the highest average quality.

This tool is great and real quick. I've looked at the code but couldn't find a good way to get also the name of the read with the highest average quality to be printed in the output (my C knowledge is fairly rusty). What I'm trying to do is to is to unique paired end reads so I need to know where one read ends and another starts to be able to separate them and use them for downstream analyses. Any ideas ? Thanks.

Win - its very fast too! I wish more things were written in C! thanks

I guess it cannot collapse paired-end reads, can it?

