Fastx_Collapser Has No Fastq Output
4
1
Entering edit mode
9.2 years ago
nbvasani ▴ 240

Hi Felllows,

I used below cmd to remove duplicate reads.

fastx_collapser -Q33 -v -i CombineIonXpresRNA.fastq -o Collapsed.fastq

But output file was fasta instead of fastq.Output file says fastq but in actually it is fasta file.

I would really appreciate your help!

Thanks, Naresh

bioinformatics fastx rna-seq • 8.8k views
1
Entering edit mode
9.2 years ago
Dan D 7.3k

I looked at the source code and the tool is written in such a way that it will not provide FASTQ output. I guess the "fastx" indicates that it takes FASTA/Q as input.

If you can provide specific requirements I can update the tool to provide FASTQ output. Requirements can be as simple as showing me a few sample input reads and then showing me what the sample output should look like.

EDIT: if you're only looking to remove duplicate reads, then FastqMCF with the -D and -0 options can do this for you.

0
Entering edit mode

Hi Deedee,

Fastq format will have quality value and header starting with @ sign as shown below. I want this both option including sequence in output file. But output file must represent all of the reads as shown below: Input: 25122053 sequences (representing 25122053 reads) Output: 17946833 sequences (representing 25122053 reads)

Fastq file format:

@GWG70:09324:09027 GAGGGAACCTGATCTCCGGTCGTCAGCTCCGTCCGATTCTGCTTCTGACTAGGAAGCCAATGGCTCTGGTTAAGAAGCTCAGGAA + 59;>4;4;7>8877885;6:;888;;:::57552577478797265776333/4+..45+..35442634.2/25466:=885;; @GWG70:09324:09031 ACTTCCTTCCACACCTTACCTAATCTAATTCCGAATCTGGGATTTGGATCTCAGAAAGATGAAGGTGGTTGATAAAATTCAAATCTGTGACAGGATCGAAGCCAA + 9661715493886604/341//+.656387<6<<59878829;:2;6;:::::::;3;<:::5<5::2:3691../)/-022010---36854145554*-/145 @GWG70:09324:09043 TCAGGTGATCATAGTCTAGTCCATCTGTGTTGTGTTTATGCTTGTTCTCCCGTTTCTTTAAATTCTATGTTCTTTAAATTTCTATGTGAAACTAGTGTTTCTATTTCCTTATTCACACTAC

Fasta format:

1-11358 TGGTGGTAACGGGTGACGGAGAATTAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCAATCCTGACACGGGGAGGTAGTGACAATAAATAAC 2-8325 GGTGCCTTCGGGAACGCGGACACAGGTGGTGCATGGCTGTCGTCAGCTCGTGCCGTAAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACC 3-7229 CGATAGTAATTCAACCTAGTACGAGAGGAACCGTTGATTCGCACAATTGGTCATCGCGCTTGGTTGAAAAGCCAGTGGCGCGAAGCTACCGTGCGCTGGATTATGA 4-7225 AGACGACTTAAATACGCGACGGGGTATTGTAAGTGGCAGAGTGGCCTTGCTGCCACGATCCACTGAGATTCAGCTTTGTCGCTAAGATTCGA 5-6952 AAGACTGTTGCCAAGCCAAAGGGTCCATCAGGCAGCCCATGGTACGGATCTGACCGAGTCAAGTACTTGGGTCCGTTCTCAGGTGAGCCACCAAGCTACCTTACCG

It would be awesome, if can update tool for me. Thanks a lot in advance.

Naresh

0
Entering edit mode
9.2 years ago

I have never used fastx_collapser but going through the manual I found the usage should be

usage: fastx_collapser [-h] [-v] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h]         = This helpful help screen.
[-v]         = verbose: print short summary of input/output counts
[-i INFILE]  = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.


I don't think that you can give -Q as a input parameter for collapser. I may be wrong. Also, can you try lower case "-q" instead of "-Q" .

0
Entering edit mode

Hi ashutoshmits,

Withour -Q as input parameter it will not excecute the command, Q33 is required for Phred quality check. -q is giving error.

Thanks, Naresh

0
Entering edit mode

Hi Naresh, Can you let me know why you want to remove the duplciate reads at the fastq level ? Is this a miRNA dataset with majority of reads being identical and you don't want to waste time in mapping ? Thanks

1
Entering edit mode

Hi ashutoshmits,

I want to remove duplicate reads because while creating de novo assembly, it is using all of my cpu memory.

Naresh

0
Entering edit mode

Hi Naresh, Can you tell me if the header of the sequences in the input fastq file remain same in output fasta file.

0
Entering edit mode

0
Entering edit mode

What about text that succeeds "@" ? I was thinking if you can write a code that borrows quality sequence information from the original file and add it to the new fasta file.

0
Entering edit mode

I am not good with writing code.

0
Entering edit mode

It is now very well documented but Q33 is valid parameter for large group of Illumina reads.

0
Entering edit mode

Actually this reads are from Ion proton sequencer.

Naresh

0
Entering edit mode

Ion Torrent uses Sanger (Phred+33), so it is same as new Illumina http://en.wikipedia.org/wiki/FASTQ_format

0
Entering edit mode

Yap, you are right.

0
Entering edit mode
9.2 years ago

Yes, it is fasta file. Fastx collapser cannot produce fastq files.

Look at Picards Mark duplicates (http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates) or samtool's rmdup (http://samtools.sourceforge.net/samtools.shtml). These two however need .bam files.

0
Entering edit mode

Hi Noolean,

Thanks for suggestion. But I need some tool on front end which can remove duplicate before creating de novo assembly.

Naresh

0
Entering edit mode

See the edit to my answer. FastqMCF will do exactly what you're asking.

0
Entering edit mode

Hi Deedee,

But can it represts all of the reads present in fastq file?

Thanks, Naresh

1
Entering edit mode

If you run FastqMCF with the -0 flag and make the argument to your -D flag the number of cycles per read, then yes, what you should receive is a FASTQ file (or two files if paired-end) containing every unique sequence from the input FASTQ, with no repeats.

0
Entering edit mode

Hi Deedee,

When I try to install with make command.

After running through several step it gave me following error:

make[1]: Leaving directory /media/DATAPART3/ea-utils.1.1.2-537/samtools'
g++ -O3 -I.  samtools/*.o -lz -lpthread fastq-lib.cpp sam-stats.cpp -o sam-stats
g++ -O3 -I.  fastq-lib.cpp tidx/tidx-lib.cpp -o varcall varcall.cpp -lgsl -lgslcblas
varcall.cpp:33:29: fatal error: gsl/gsl_randist.h: No such file or directory
compilation terminated.
make: *** [varcall] Error 1


Any Suggestion to solve this error.

Thanks,
Naresh

0
Entering edit mode

You can do one of two things as far as I know:

install the next-oldest version, which should work perfectly for the functionality you need:

Per the issues page, you can install the GNU Scientific Library tools. How to do it depends on your OS but it's pretty easy:

0
Entering edit mode

Hi Deedee,

I install the older version, but unfortunately it does not have -D option.

So I try to find gsl-bin file.

Thanks a lot for your help. Naresh

0
Entering edit mode
wget ftp://ftp.gnu.org/gnu/gsl/gsl-1.9.tar.gz
tar xzf gsl-1.9.tar.gz
cd gsl-1.9
./configure
make
make install

0
Entering edit mode

Thanks a lot for link to download gsl. I am sudo user, but still is showing me permission denied, when I excecute ./configure command.

Any suggestion?

Thanks, Naresh

0
Entering edit mode

Does configure have an executable bit on it? And are you sure you're in the gsl-1.9 directory when you try to execute the command?

0
Entering edit mode

I am in gsl-1.9 directory. It does not have executable bit on it. It is script type not executable type.

0
Entering edit mode

OK to make it executable use the command chmod 0500 configure while you're in the gsl-1.9 directory. This pattern will allow only your user or root to execute the program. Then try it again via sudo. You can also try providing the absolute path to configure instead of prepending with ./ I'm not directly observing, so it's hard to be sure exactly why configure isn't running, but hopefully one of those two things will get you rolling.

0
Entering edit mode

Thanks for your quick response. I follow same step as you suggested. But still no luck.

0
Entering edit mode

Which Linux distribution are you using?

0
Entering edit mode
Linux version 3.6.11-4.fc16.x86_64 (mockbuild@bkernel02) (gcc version 4.6.3 20120306 (Red Hat 4.6.3-2) (GCC) ) #1 SMP Tue Jan 8 20:57:42 UTC 2013

1
Entering edit mode

what happens when you type sudo yum install gsl-devel?

0
Entering edit mode

Yum install below file:

Installed: gsl-devel.x86_64 0:1.15-3.fc16

Dependency Installed:

atlas.x86_64 0:3.8.4-1.fc16

gsl.x86_64 0:1.15-3.fc16

1
Entering edit mode

Score! Try installing the most recent version of ea-utils again and then try the command I originally showed you. Having the GNU Scientific Library installed will be useful for installing other tools down the road as well.

0
Entering edit mode

YaaaaaaaaaaaaaaaaaaaaaaaaaY! It work like a charm.

Thanks a lot for step by step guidance.

Naresh

0
Entering edit mode

Great! Please don't forget to mark the answer that worked as "accepted" so that future visitors will know what to do.

0
Entering edit mode

Sure. Thanks once again.

0
Entering edit mode

Hi Deedee,

Sorry for bothering you again.

I got a lot difference in term of output reads while using 2 different tool.

With fastq-mcf I am getting only 6 million reads from 25 million, after removing duplicate reads.

#fastq-mcf -0 -D 35 n/a Inputfile.fastq -o output.fastq


Any suggestion?

Thanks,
Naresh

0
Entering edit mode

The argument to -D should be your read length. Are you putting exactly what your read length is? Otherwise it's going to trim reads that have at least 35 bases in common, which will substantially reduce your population.

0
Entering edit mode

My fastq file read length is from 50-350bp. So which read length should I consider to put?

Thanks, Naresh

0
Entering edit mode
6.1 years ago

Albeit this is an old post, maybe some other will find this useful too: http://seqcluster.readthedocs.io/collapse.html

N.B. That it requires quite a few python dependencies, which, just as seqcluster itself, can be installed via pip.

[UPDATE] The code described at A: Is There A Fastq Alternative To Fastx_Collapser (Outputs Fasta)? by @brentp seems to fit my needs better as it keeps the quality values of the read with the highest quality rather than creating an average quality as in the case of seqcluster`.