Question: using NCBI SRA data for prinseq
0
gravatar for mforthman
10 months ago by
mforthman30
mforthman30 wrote:

I have been trying to download SRA data from NCBI and putting it in fastq format using fastq-dump. A colleague and I have been trying to figure out why the resulting fastq files are causing some errors when inputted into prinseq-lite.

My collaborator has been using this fastq-dump command:

fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRR5040251

This is the resulting fastq file for read /1 (we also have the corresponding read /2 file):

@FCC4LTMACXX:1:1101:2339:1998/1
NGATAATTAGAACTATAACCCCCTTCCTGCTCTATAGATAAGATTTGATAATTCTGACCATATACCAGAACCCCCCATTCCGTATTATTAG
+SRR5040251.1 FCC4LTMACXX:1:1101:2339:1998 length=91
#1=DDDDDEDDDDIIIBE?CF@A)CBE>CBCD*:C@@?9**??*?B*?DD99D?B44*?DC@C###########################@
@FCC4LTMACXX:1:1101:3060:1995/1
NTGCTTCTCAAGGTGGCCATCAAATTGTTAAGTTGTTCCTTGTAAGAGGAAGATACGGTGGCGAAGCCACCACCCTTCTTTCCACGGCCAT
+SRR5040251.2 FCC4LTMACXX:1:1101:3060:1995 length=91
#1=DFFFFHHHHHEHIJJJJJJJJJJJJJJJJIJJGJJJJJJIGGJIJJIJJJHIFDEFHIGIGJHGGFFFFDDCACDDDDDDEDBDDDDC
@FCC4LTMACXX:1:1101:3278:1996/1
NTTATTTGTTCAAACTACTTCTGATTGGAGATTCTGGAGTAGGGAAATCGTGCTTATTGTTGAGATTTGCGGATGATGCTTATTCTGAAAG
+SRR5040251.3 FCC4LTMACXX:1:1101:3278:1996 length=91
#4BDFFFFHHHHHJJJIJJJJJJJJJJJJJIJJJJJJHJFHIJJHJIJJJHIJJIJJJJHIJIIHJJJJJJHHFFEEEEEEEEEEFEDDDC
@FCC4LTMACXX:1:1101:4171:1998/1
NGTCCCCAAACCCCAGATCAAATAGTACCGGACCGTTAAAACACTCTGTAATCATTTTTTGGTATAACTGTGTTTTATTTTGAAGACATGG
+SRR5040251.4 FCC4LTMACXX:1:1101:4171:1998 length=91
#1=DDFFFHHHHHJJJHJJJJJJJIIIIJJJJJIJGIIJJJJJJJJIJJJJIJGIJHHHFFDAEEDDDDDCDACDCDDDEEDDDDDDDDDC
@FCC4LTMACXX:1:1101:5115:1991/1
NGACCACAGACGCTTAGCTCTCCAGAGCCCGGTGAAGTTGAAGAGTCATTGGATGCGCCTTTCGCCATGAGCCAAACAGAATCACCAGCTC
+SRR5040251.5 FCC4LTMACXX:1:1101:5115:1991 length=91
#4=DFFFFHHHHHJJJJJJJJJJJJJJJJJJJDHHIJHIGIIJJICHIJJIJJJIJJHHHFFFFDDDDDDDDDDDCBDDDDDDDDDDDDDB

When using prinseq-lite:

prinseq-lite.pl -fastq SRR5040251_1.fastq -fastq2 SRR5040251_2.fastq -derep 12345

Which produces the following error:

ERROR: input file for -fastq is in UNKNOWN format not in FASTQ format.

We have been searching all day and cannot find a solution to this.

ADD COMMENTlink modified 10 months ago by swbarnes24.9k • written 10 months ago by mforthman30
1
gravatar for swbarnes2
10 months ago by
swbarnes24.9k
United States
swbarnes24.9k wrote:

Just checking wikipedia, which isn't necessarily quoting the true authority, the + line has to either be blank, or a copy of the @ line.

So I'd fix that, or fix the perl script to not mind that, and try again.

ADD COMMENTlink written 10 months ago by swbarnes24.9k

Exactly, it should perfectly work if you fix the problem of the line starting with the symbol "+".

ADD REPLYlink written 10 months ago by Sishuo Wang140
0
gravatar for genomax
10 months ago by
genomax63k
United States
genomax63k wrote:

I suggest that you use -F option to retrieve the original Illumina format fastq headers using fastq-dump or get the fastq files directly from EBI-ENA.

ADD COMMENTlink written 10 months ago by genomax63k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1801 users visited in the last hour