5.0 years ago
Pals

Hi,

After removing rRNAs using sortmerna, 2/42 files failed fastqc with error:

I checked the fastq file using fastq_stats which suggested the fastq file is malformed

Malformed fastq record (length mismatch) at line 69042283 Segmentation fault (core dumped)

can anybody please tell me what's the problem in this file?

zcat 4590_END_sortmerna_R2.fastq.gz | sed -n '69042275,69042295p'

+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIFIIFIIIIIIFIIIIIFIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFBFFFFFF
@HWI-ST1072:210:C5LGGACXX:4:1309:19821:97176 2:N:0:AGTTCC
CTGGAATCCCTCGATCACTGAACAGGAAGGAAACCTGATGCAGAGACTCAGGACGCAGGCTCCAGAAGTCCCAGACCATGTCCGGATCCTTCAGGTGTGTT
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIFFIIIIIIIBFIIIIFFFIIIIIFFFFFFFFFBBBBBFFBFBFFFFFFBFFFFFFFFBBBBBBBBB
@HWI-ST1072:210:C5LGGACXX:4:1309:19994:97188 2:N:0:AGTTCC
CTCATGTACCGGGGGTTACTGGATGCTCTGCTGCAGACGGCTCGGACAGAGGGCATTTTTGGCATGTACAAGGGAATAGGTGCCTCCTACTTCCGCCTTGG
+
BBBFFFFFFFFFFIIFFIIIIIIIIIIIIIIIIIIIIIFFFFFFFBBBFB<BB7<BFFFFFFBBBBBFFFFFFFBBFBFBBBBFFFFFFFFFFFFFFFFBB
@HWI-ST1072:210:C5LGGACXX:4:1309:19837:97199 2:N:0:AGTTCC
CCTACTTCTCTTTTAATTAAAATGTAGCCACTATATAACAGATATGGTTTACGATCTGTTTCAAAAACATTTACCTAAGTGACTCTGTAAAAACCTCAGCA
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIIIFIIBFFIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFB
@HWI-ST1072:210:C5LGGACXX:4:1309:19815:97203 2:N:0:AGTTCC
GGGCCATTCGGGGTTGGCTGGGGGTGAGTGGCAGAGCAAGCTTTCAGCACAGGGCGGCATGTGCCAAAAGCACGGTGGCCTGAGGCTGCCCCTCGTGAGCA
+
BBBFFFFFFFFFFFIIIIIIIIFIFFFIFIFIIIIIIFFFFFBFFFFFFF<BBBBBFFFFBBBBFFFFFFBFFFFFFFBFFFFFFF0BBFFFBFFBF<BB0
@HWI-ST1072:210:C5LGGACXX:4:1309:19975:97229 2:N:0:AGTTCC
AGAGACTGCATGGCTTTCATGACGTGAAGATTGGGCACATTCTTGTCTGCCAGCTCCGGGTGCTTGGGCATGTGGACATCCTTCTTGGCCACCATCACCCC
+


Thanks a lot!

I would just like to also report the same problem. I also had corrupted fastq file output after SortMeRNA run and it is also missing lines with quality scores. If someone has sorted out this issue would be grateful if you could share your solution. Thanks

5.0 years ago
GenoMax

Purely looking at the line in question nothing seems to be wrong but obviously the error is there with FastQC. This appears to be a paired end dataset so you could use repair.sh from BBMap suite to see if you can identify the malformed read pairs and remove them.

To verify proper pairing: reformat.sh in1=r1.fq.gz in2=r2.fq.gz vpair
To repair: repair.sh in1=r1.fq.gz in2=r2.fq.gz out1=fixed1.fq.gz out2=fixed2.fq.gz outsingle=singletons.fq.gz

After the following error, the output file size is 3.1 compared to 3.3 (original). It seems sortmerna steps created the problem in reads.

Started output stream.
java.lang.AssertionError:
Error in 4563_END_sortmerna_R2.fastq.gz, line 194876135, with these 4 lines:
BBBFFFFFFFFFFIIIIIIIIIFIIIIIIIIIIIIIFIIFIIIIIIIIIIIIIIIIIIIIIFFFFFFFFF7<BFF<FFFFFFFFFFFFFFFFFFFFFFFFF
@HWI-ST1072:210:C5LGGACXX:2:2305:2793:35345 2:N:0:CAGATC
CGGGGATTCTCCATTCCGGATGTGTTTCGTGGAGTGCATCGATACCTGCGCAATGCCTATGCTAGGGAAGAGTTTGCTTCCACCTGTCCAGATGATGAGGA
+

at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656) at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)

Set cris2Active=false
java.lang.AssertionError:
Error in 4563_END_sortmerna_R1.fastq.gz, line 194876135, with these 4 lines:

@HWI-ST1072:210:C5LGGACXX:2:2305:2793:35345 1:N:0:CAGATC
ATTTTGGAGTGTGTCCGTTGGGTAGTATGTGGAAACCACCCAGGGCCTTTGTGGAGAAAATGGAGGGGGGTGCCGGGGGGCCCTAGGAAGGGCCTTATTTG
+

at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656) at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)

Set cris1Active=false
at jgi.SplitPairsAndSingles.process3_repair(SplitPairsAndSingles.java:562)
at jgi.SplitPairsAndSingles.process2(SplitPairsAndSingles.java:310)
at jgi.SplitPairsAndSingles.process(SplitPairsAndSingles.java:236)
at jgi.SplitPairsAndSingles.main(SplitPairsAndSingles.java:45)

What step is this error from? Before running repair.sh or while running repair.sh?

I thought you were able to repair the files before.

While running the repair.sh

Did you check the files with vpair first? Is the error from that step or actual repair?

I am going to tag Brian Bushnell . He should be along in 2-3 hours. It seems to me that some fastq records themselves may be broken (in addition to pairing).

Here is the error from vpair step:

/homeappl/home/kipokh/appl_taito/mytools/bbmap/reformat.sh in1=4563_END_sortmerna_R1.fastq.gz in2=4563_END_sortmerna_R2.fastq.gz vpair

java -ea -Xmx200m -cp /homeappl/home/kipokh/appl_taito/mytools/bbmap/current/ jgi.ReformatReads in1=4563_END_sortmerna_R1.fastq.gz in2=4563_END_sortmerna_R2.fastq.gz vpair

No output stream specified.  To write to stdout, please specify 'out=stdout.fq' or similar.
Set INTERLEAVED to false
Input is being processed as paired
java.lang.AssertionError:
Error in 4563_END_sortmerna_R1.fastq.gz, line 194876135, with these 4 lines:

@HWI-ST1072:210:C5LGGACXX:2:2305:2793:35345 1:N:0:CAGATC
ATTTTGGAGTGTGTCCGTTGGGTAGTATGTGGAAACCACCCAGGGCCTTTGTGGAGAAAATGGAGGGGGGTGCCGGGGGGCCCTAGGAAGGGCCTTATTTG
+

at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656) at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
java.lang.AssertionError:
Error in 4563_END_sortmerna_R2.fastq.gz, line 194876135, with these 4 lines:
BBBFFFFFFFFFFIIIIIIIIIFIIIIIIIIIIIIIFIIFIIIIIIIIIIIIIIIIIIIIIFFFFFFFFF7<BFF<FFFFFFFFFFFFFFFFFFFFFFFFF
@HWI-ST1072:210:C5LGGACXX:2:2305:2793:35345 2:N:0:CAGATC
CGGGGATTCTCCATTCCGGATGTGTTTCGTGGAGTGCATCGATACCTGCGCAATGCCTATGCTAGGGAAGAGTTTGCTTCCACCTGTCCAGATGATGAGGA
+

at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656) at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
There appear to be different numbers of reads in the paired input files.
The pairing may have been corrupted by an upstream process.  It may be fixable by running repair.sh.

It appears that you have records with bases but no quality scores. Those are not valid fastq records. It looks like maybe, some of the records have 2 newlines after the "+" instead of 1.

Probably, sortmerna is doing something inappropriate that results in corrupted output.

@Brian: Is there something in BBMap to check and fix broken fastq records in files?

Well... I tried to add some capability to do so in some limited ways, but it doesn't work very well. The problem is that there are an infinite number of ways a file could be broken, so the software will always end up guessing. So it's impossible to write a program that can correctly fix any broken fastq file, or even that can ensure that the corrections it did were correct. You can only write a program that will fix very specific problems, like a correctly-formatted fastq file with interleaved reads out of order.

Thanks a lot Brian and genomax2. I had also asked this problem to Jenya Kopylov (author of sortmerna) and I decided to skip sortmerna step for now. By the way, I found BBTools really useful :)

Are you able to actually fix the broken pairing by running reformat.sh (second command)?

One additional thing to test would be to validate the individual files to see if the fastq records themselves are broken: http://genome.sph.umich.edu/wiki/FastQValidator