1
0
Entering edit mode
6.3 years ago
mmitra ▴ 30

Hi all,

I ran tophat on my fastq file and I got the following error:

[2015-09-02 11:23:58] Beginning TopHat run (v2.1.0)
-----------------------------------------------
[2015-09-02 11:23:58] Checking for Bowtie
Bowtie version:     2.2.5.0
[2015-09-02 11:24:00] Checking for Bowtie index files (genome)..
[2015-09-02 11:24:00] Checking for reference FASTA file
[2015-09-02 11:24:06] Reading known junctions from GTF file
[FAILED]


Any suggestions for this? Thanks so much!

RNA-Seq fastq tophat • 2.4k views
0
Entering edit mode

Run grep -A4 @HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0: Input.fastq and copy what you get here.

0
Entering edit mode

Thanks for your help. I did the grep as you suggested and got the following:

grep -A4 @HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0: P_R3R4_filtered75.fastq
grep: 1:N:0:: No such file or directory
P_R3R4_filtered75.fastq:@HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0:
P_R3R4_filtered75.fastq-CGCACTCCTGCTCGGACAGCTCCAGGTACGTCTGGTGGTCAATCAGGCCCTTGCGGTA
P_R3R4_filtered75.fastq-@HWI-ST387:212:D1AA6ACXX:4:1102:2629:2187 1:N:0:
P_R3R4_filtered75.fastq-CAACACCACAGCCATTGCTGAGGCCTGGGCTCGCCTGGACCACAAGTTTGACCTGATGTATGCCAAACGTGCCTT
P_R3R4_filtered75.fastq-+


All the reads of this fastq file are of length 75. I created this file for running rMATS. I followed the awk command from here to extract all reads of length 75: Filtering Fastq Sequences Based On Lengths

I also did the tophat on the original fastq file (before extraction) and that ran fine.

0
Entering edit mode

Can you paste a cleaner version of the output. I doubt you would see something like grep: 1:N:0:: No such file or directory when you perform grep. Also why we are seeing P_R3R4_filtered75.fastq: or P_R3R4_filtered75.fastq- tag in front of every line. You know how fastq format looks like, right?. The awk command solution that you used assumes that a fastq record is distributed over four lines. That may be a problem but this is just my guess. I may not speculate much unless I see a cleaner output. Try:

grep -A4 "^@HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0"


and paste the output again.

0
Entering edit mode

Sorry, I forgot to put the search item in quotes. I ran the following command:

grep -A4 '@HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0:' P_R3R4_filtered75.fastq


I got the following:

@HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0:
CGCACTCCTGCTCGGACAGCTCCAGGTACGTCTGGTGGTCAATCAGGCCCTTGCGGTA
@HWI-ST387:212:D1AA6ACXX:4:1102:2629:2187 1:N:0:
CAACACCACAGCCATTGCTGAGGCCTGGGCTCGCCTGGACCACAAGTTTGACCTGATGTATGCCAAACGTGCCTT
+

0
Entering edit mode

The problem is that @HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0: read has no quality information. A fastq entry takes 4 lines. The first line contains the header, second lines contains the sequence, third line is usually the + sign, and the fourth line contains quality sequence. The above fastq entry which is throwing error is missing third and the fourth line. If you dont know why it happened probably delete this fastq entry and make sure you delete all such entries.

0
Entering edit mode

Thanks for the suggestion. I am wondering if there is a way to globally check whether the fastq entries of all the reads are fine and to remove the corrupt entries. I have several fastq files and that would be very useful.

0
Entering edit mode

0
Entering edit mode
6.3 years ago
cat Input.fastq | paste - - - - | awk ' $1 ~ /^@HWI/ &&$3 ~/^+/' |  sed 's/\t/\n/g'  > QC_filtered.fastq

This code will remove any four lines where first line doesnt start with "@HWI" (matching string may change with files so do change it accordingly)  and third line doesn't start with "+". The above code works on the assumption that a fastq entry spans 4 lines.

0
Entering edit mode

Thanks for the code. I tried but it did not work. It gave an empty output. I also checked for "@HWI" in my input file. I am assuming it worked for you. Any suggestions?

0
Entering edit mode

I am not sure why it didnt give you any output but I can see a potential bug in my code. As soon it meets the first weird or wrong fastq entry it would start throwing error for all the entries afterwards as it is reading 4 reads at a time and that order has already been messed up by the first wrong entry. This is a good piece of code (https://scipher.wordpress.com/2010/05/06/simple-python-fastq-parser/) that will help you to find the problem but you may have to manually delete bad fastq entries.