using NCBI SRA data for prinseq
2
0
3.6 years ago
mforthman ▴ 40

I have been trying to download SRA data from NCBI and putting it in fastq format using fastq-dump. A colleague and I have been trying to figure out why the resulting fastq files are causing some errors when inputted into prinseq-lite.

My collaborator has been using this fastq-dump command:

fastq-dump --defline-seq '@$sn[_$rn]/\$ri' --split-files SRR5040251


This is the resulting fastq file for read /1 (we also have the corresponding read /2 file):

@FCC4LTMACXX:1:1101:2339:1998/1
NGATAATTAGAACTATAACCCCCTTCCTGCTCTATAGATAAGATTTGATAATTCTGACCATATACCAGAACCCCCCATTCCGTATTATTAG
+SRR5040251.1 FCC4LTMACXX:1:1101:2339:1998 length=91
#1=DDDDDEDDDDIIIBE?CF@A)CBE>CBCD*:C@@?9**??*?B*?DD99D?B44*?DC@C###########################@
@FCC4LTMACXX:1:1101:3060:1995/1
NTGCTTCTCAAGGTGGCCATCAAATTGTTAAGTTGTTCCTTGTAAGAGGAAGATACGGTGGCGAAGCCACCACCCTTCTTTCCACGGCCAT
+SRR5040251.2 FCC4LTMACXX:1:1101:3060:1995 length=91
#1=DFFFFHHHHHEHIJJJJJJJJJJJJJJJJIJJGJJJJJJIGGJIJJIJJJHIFDEFHIGIGJHGGFFFFDDCACDDDDDDEDBDDDDC
@FCC4LTMACXX:1:1101:3278:1996/1
NTTATTTGTTCAAACTACTTCTGATTGGAGATTCTGGAGTAGGGAAATCGTGCTTATTGTTGAGATTTGCGGATGATGCTTATTCTGAAAG
+SRR5040251.3 FCC4LTMACXX:1:1101:3278:1996 length=91
#4BDFFFFHHHHHJJJIJJJJJJJJJJJJJIJJJJJJHJFHIJJHJIJJJHIJJIJJJJHIJIIHJJJJJJHHFFEEEEEEEEEEFEDDDC
@FCC4LTMACXX:1:1101:4171:1998/1
NGTCCCCAAACCCCAGATCAAATAGTACCGGACCGTTAAAACACTCTGTAATCATTTTTTGGTATAACTGTGTTTTATTTTGAAGACATGG
+SRR5040251.4 FCC4LTMACXX:1:1101:4171:1998 length=91
#1=DDFFFHHHHHJJJHJJJJJJJIIIIJJJJJIJGIIJJJJJJJJIJJJJIJGIJHHHFFDAEEDDDDDCDACDCDDDEEDDDDDDDDDC
@FCC4LTMACXX:1:1101:5115:1991/1
NGACCACAGACGCTTAGCTCTCCAGAGCCCGGTGAAGTTGAAGAGTCATTGGATGCGCCTTTCGCCATGAGCCAAACAGAATCACCAGCTC
+SRR5040251.5 FCC4LTMACXX:1:1101:5115:1991 length=91
#4=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJDHHIJHIGIIJJICHIJJIJJJIJJHHHFFFFDDDDDDDDDDDCBDDDDDDDDDDDDDB


When using prinseq-lite:

prinseq-lite.pl -fastq SRR5040251_1.fastq -fastq2 SRR5040251_2.fastq -derep 12345


Which produces the following error:

ERROR: input file for -fastq is in UNKNOWN format not in FASTQ format.


We have been searching all day and cannot find a solution to this.

1
3.6 years ago

Just checking wikipedia, which isn't necessarily quoting the true authority, the + line has to either be blank, or a copy of the @ line.

So I'd fix that, or fix the perl script to not mind that, and try again.

0
Exactly, it should perfectly work if you fix the problem of the line starting with the symbol "+".

0
3.6 years ago
GenoMax 110k

I suggest that you use -F option to retrieve the original Illumina format fastq headers using fastq-dump or get the fastq files directly from EBI-ENA.