Question

Failure in generating fastq from human fasta

0

Entering edit mode

4.6 years ago

marongiu.luigi ▴ 710

Hello, I have created some mutated human sequences modifying the GCRh38 fasta files. I then concatenated the files adn generated the fastq files with

$ art -1 .../art/Illumina_profiles/custom/HiSeq2k_0m1.txt -2 ...art/Illumina_profiles/custom/HiSeq2k_0m2.txt -p -f 100 -l 140 -m 300 -s 10  -i humanMut.fa -o Mut

This command worked for smaller genomes with coverage of 30-50; HiSeq2k_0m1|2.txt are the quality profiles. When I check the quality though:

$ fastqc Mut_1.fq
...
Approx 95% complete for sismi2N_1.fq
Failed to process file sismi2N_1.fq
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry.  Your file is probably truncated
    at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:179)
    at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:125)
    at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:76)
    at java.lang.Thread.run(Thread.java:745)

What can be the problem here? Why the protocol was working before? Is the file too large? Is the coverage too large? Thanks

genome next-gen fastqc art • 1.9k views

ADD COMMENT • link 4.6 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

The error is quite clear, is it?

Ran out of data in the middle of a fastq entry.  Your file is probably truncated

Check the fq file for truncated sequences.

ADD REPLY • link 4.6 years ago by ATpoint 82k

0

Entering edit mode

that is exactly what i don't understand: how can the fastq be truncated? i followed the same strategy, that is converting a fasta into fastq, i did not touch the fastq, how did they get truncated? and how can i check what sequences are truncated?

ADD REPLY • link 4.6 years ago by marongiu.luigi ▴ 710

1

Entering edit mode

I cannot tell you why this is, can be memory shortage, premature kill of the job, bug in code...

I would start with validating the fastq files, e.g. https://genome.sph.umich.edu/wiki/FastQValidator or a simple awk command that checks if SEQ and QUAL are the same for all entries. As fastqc complained at > 95% complete, maybe tail your.fastq would be a good start, as it could be the last entry that is odd. repair.sh from BBMap suite might be worth looking at as well. Also check if running fastqc with maximum verbosity helps narrowing down the problem.

ADD REPLY • link 4.6 years ago by ATpoint 82k

0

Entering edit mode

Thanks, I'll try that...

ADD REPLY • link 4.6 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

Hi, I tried FastqValidator but it only told me the obvious:

$ fastQValidator --file ~/Downloads/sismi2Na_1.fq 
ERROR on Line 559644077: Incomplete Sequence.

Finished processing /home/gigiux/Downloads/sismi2Na_1.fq with 559644077 lines containing 139911020 sequences.
There were a total of 1 errors.
Returning: 1 : FASTQ_INVALID

Is there a way to pick the entry that gave the error and what is the error? In running the test I got, as exected:

$ fastQValidator --file ~/src/fastQValidator/test/testFile.txt.gz 
ERROR on Line 2: Invalid character ('.') in base sequence.
ERROR on Line 2: Invalid character ('0') in base sequence.
ERROR on Line 2: Invalid character ('1') in base sequence.
ERROR on Line 2: Invalid character ('2') in base sequence.
ERROR on Line 2: Invalid character ('3') in base sequence.
ERROR on Line 11: Invalid character ('1') in base sequence.
ERROR on Line 11: Invalid character ('2') in base sequence.
ERROR on Line 11: Invalid character ('3') in base sequence.
ERROR on Line 11: Invalid character ('.') in base sequence.
ERROR on Line 11: Invalid character ('0') in base sequence.
ERROR on Line 11: Invalid character ('3') in base sequence.
ERROR on Line 11: Invalid character ('2') in base sequence.
ERROR on Line 11: Invalid character ('1') in base sequence.
ERROR on Line 11: Invalid character ('.') in base sequence.
ERROR on Line 11: Invalid character ('0') in base sequence.
ERROR on Line 11: Invalid character ('1') in base sequence.
ERROR on Line 11: Invalid character ('1') in base sequence.
ERROR on Line 25: The sequence identifier line was too short.
ERROR on Line 29: First line of a sequence does not begin with @
ERROR on Line 33: No Sequence Identifier specified before the comment.
Finished processing /home/gigiux/src/fastQValidator/test/testFile.txt.gz with 95 lines containing 21 sequences.
There were a total of 48 errors.
Returning: 1 : FASTQ_INVALID

So it is worrisome that I did not get anything from my file...

ADD REPLY • link 4.6 years ago by marongiu.luigi ▴ 710