Question

ABySS error: sequence and quality must be the same length near ...

0

Entering edit mode

7.1 years ago

jozs2019 ▴ 10

Hello!

I'm trying to use ABySS 1.9.0 to assemble set of genomic paired end reads. I'm doing this on my university's HPC with parallel processing. In my PBS script, I use:

abyss-pe name=abyss_test1 k=63 in='SRR960028_1.fastq SRR960028_2.fastq' v=-v

Unfortunately, ABySS terminates quite quickly after commencing assembly, and the error message I get is:

SRR960028_1.fastq:745672: error: sequence and quality must be the same length near TGGGGACGGCAAGTATCACAGGTGACCCACTCACTGTTTCACCTCTCACCCTAATATGACCGTGTCTACAAGAAGTCAGTCAGCTGTTTCTGTTCCCCAGTGAGAGAGCAA$ CCCFFFFFHHHHHGHIIIIIIIDHHIIIIIIIIIIIIIGIEIIIIIIII make: * [abyss_test1-1.fa] Error 1

When I open the other file, I get:

/usr/local/openmpi/1.8.4-gcc/bin/mpirun -np 4 ABYSS-P -k63 -q3 -v --coverage-hist=coverage.hist -s $ ABySS 1.9.0 ABYSS-P -k63 -q3 -v --coverage-hist=coverage.hist -s abyss_test1-bubbles.fa -o abyss_test1-1.fa SRR96$ Running on 4 processors 0: Running on host hpc088 1: Running on host hpc088

2: Running on host hpc088

3: Running on host hpc088

0: Reading 'SRR960028_1.fastq'...

1: Reading 'SRR960028_2.fastq'...

1: Read 100000 reads. 1: Hash load: 3609195 / 268435456 = 0.0134 using 447 MB

0: Read 100000 reads. 0: Hash load: 3898410 / 268435456 = 0.0145 using 469 MB

1: Read 200000 reads. 1: Hash load: 7004611 / 268435456 = 0.0261 using 646 MB

Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[12781,1],0] Exit code: 1

Does anyone know what this error means and how I can fix it?

error genome abyss assembly • 2.7k views

ADD COMMENT • link updated 7.1 years ago by benv ▴ 730 • written 7.1 years ago by jozs2019 ▴ 10

score 1 · Answer 1 · 2017-03-20

FASTQ records consist of 4 lines each. The first line is the header and contains the read ID. The second line contains the sequence. The third line is just a "+". The fourth line is the quality score string.

Each character in the quality score string encodes quality score for the corresponding base in the sequence string. Thus the length of quality score line should be exactly the same length as the sequence line. For one of your FASTQ records the lines have different lengths, and ABySS is telling you the approximate line in SRR960028_1.fastq where the problem record is located (line 745672).

I would recommend first manually looking at the record in question (less is a good tool for that.) Then you will need to either figure out what upstream processing step caused the line lengths to be different or write a unix script to fix the FASTQ file such that the line lengths are always the same (e.g. sed, awk, perl, python).