Question: using NCBI SRA data for prinseq
10 months ago
mforthman30 wrote:

I have been trying to download SRA data from NCBI and putting it in fastq format using fastq-dump. A colleague and I have been trying to figure out why the resulting fastq files are causing some errors when inputted into prinseq-lite.

My collaborator has been using this fastq-dump command:

fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRR5040251

This is the resulting fastq file for read /1 (we also have the corresponding read /2 file):

+SRR5040251.1 FCC4LTMACXX:1:1101:2339:1998 length=91
+SRR5040251.2 FCC4LTMACXX:1:1101:3060:1995 length=91
+SRR5040251.3 FCC4LTMACXX:1:1101:3278:1996 length=91
+SRR5040251.4 FCC4LTMACXX:1:1101:4171:1998 length=91
+SRR5040251.5 FCC4LTMACXX:1:1101:5115:1991 length=91

When using prinseq-lite: -fastq SRR5040251_1.fastq -fastq2 SRR5040251_2.fastq -derep 12345

Which produces the following error:

ERROR: input file for -fastq is in UNKNOWN format not in FASTQ format.

We have been searching all day and cannot find a solution to this.

modified 10 months ago • written 10 months ago by mforthman30
10 months ago
United States
swbarnes24.9k wrote:

Just checking wikipedia, which isn't necessarily quoting the true authority, the + line has to either be blank, or a copy of the @ line.

So I'd fix that, or fix the perl script to not mind that, and try again.

written 10 months ago by swbarnes24.9k

Exactly, it should perfectly work if you fix the problem of the line starting with the symbol "+".

written 10 months ago by Sishuo Wang140
10 months ago
United States
genomax63k wrote:

I suggest that you use -F option to retrieve the original Illumina format fastq headers using fastq-dump or get the fastq files directly from EBI-ENA.

written 10 months ago by genomax63k
