Question: Failure in generating fastq from human fasta
0
gravatar for marongiu.luigi
15 days ago by
Germany, Mannheim, UMM
marongiu.luigi380 wrote:

Hello, I have created some mutated human sequences modifying the GCRh38 fasta files. I then concatenated the files adn generated the fastq files with

$ art -1 .../art/Illumina_profiles/custom/HiSeq2k_0m1.txt -2 ...art/Illumina_profiles/custom/HiSeq2k_0m2.txt -p -f 100 -l 140 -m 300 -s 10  -i humanMut.fa -o Mut

This command worked for smaller genomes with coverage of 30-50; HiSeq2k_0m1|2.txt are the quality profiles. When I check the quality though:

$ fastqc Mut_1.fq
...
Approx 95% complete for sismi2N_1.fq
Failed to process file sismi2N_1.fq
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry.  Your file is probably truncated
    at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:179)
    at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:125)
    at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:76)
    at java.lang.Thread.run(Thread.java:745)

What can be the problem here? Why the protocol was working before? Is the file too large? Is the coverage too large? Thanks

fastqc next-gen art genome • 121 views
ADD COMMENTlink written 15 days ago by marongiu.luigi380

The error is quite clear, is it?

Ran out of data in the middle of a fastq entry.  Your file is probably truncated

Check the fq file for truncated sequences.

ADD REPLYlink written 15 days ago by ATpoint23k

that is exactly what i don't understand: how can the fastq be truncated? i followed the same strategy, that is converting a fasta into fastq, i did not touch the fastq, how did they get truncated? and how can i check what sequences are truncated?

ADD REPLYlink written 15 days ago by marongiu.luigi380
1

I cannot tell you why this is, can be memory shortage, premature kill of the job, bug in code...

I would start with validating the fastq files, e.g. https://genome.sph.umich.edu/wiki/FastQValidator or a simple awk command that checks if SEQ and QUAL are the same for all entries. As fastqc complained at > 95% complete, maybe tail your.fastq would be a good start, as it could be the last entry that is odd. repair.sh from BBMap suite might be worth looking at as well. Also check if running fastqc with maximum verbosity helps narrowing down the problem.

ADD REPLYlink modified 15 days ago • written 15 days ago by ATpoint23k

Thanks, I'll try that...

ADD REPLYlink written 15 days ago by marongiu.luigi380

Hi, I tried FastqValidator but it only told me the obvious:

$ fastQValidator --file ~/Downloads/sismi2Na_1.fq 
ERROR on Line 559644077: Incomplete Sequence.

Finished processing /home/gigiux/Downloads/sismi2Na_1.fq with 559644077 lines containing 139911020 sequences.
There were a total of 1 errors.
Returning: 1 : FASTQ_INVALID

Is there a way to pick the entry that gave the error and what is the error? In running the test I got, as exected:

$ fastQValidator --file ~/src/fastQValidator/test/testFile.txt.gz 
ERROR on Line 2: Invalid character ('.') in base sequence.
ERROR on Line 2: Invalid character ('0') in base sequence.
ERROR on Line 2: Invalid character ('1') in base sequence.
ERROR on Line 2: Invalid character ('2') in base sequence.
ERROR on Line 2: Invalid character ('3') in base sequence.
ERROR on Line 11: Invalid character ('1') in base sequence.
ERROR on Line 11: Invalid character ('2') in base sequence.
ERROR on Line 11: Invalid character ('3') in base sequence.
ERROR on Line 11: Invalid character ('.') in base sequence.
ERROR on Line 11: Invalid character ('0') in base sequence.
ERROR on Line 11: Invalid character ('3') in base sequence.
ERROR on Line 11: Invalid character ('2') in base sequence.
ERROR on Line 11: Invalid character ('1') in base sequence.
ERROR on Line 11: Invalid character ('.') in base sequence.
ERROR on Line 11: Invalid character ('0') in base sequence.
ERROR on Line 11: Invalid character ('1') in base sequence.
ERROR on Line 11: Invalid character ('1') in base sequence.
ERROR on Line 25: The sequence identifier line was too short.
ERROR on Line 29: First line of a sequence does not begin with @
ERROR on Line 33: No Sequence Identifier specified before the comment.
Finished processing /home/gigiux/src/fastQValidator/test/testFile.txt.gz with 95 lines containing 21 sequences.
There were a total of 48 errors.
Returning: 1 : FASTQ_INVALID

So it is worrisome that I did not get anything from my file...

ADD REPLYlink modified 6 days ago • written 8 days ago by marongiu.luigi380
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2076 users visited in the last hour