Question: TopHat Error Qual length differs from seq length
0
gravatar for williamsbrian5064
13 months ago by
williamsbrian5064120 wrote:

I am getting this error when I try to run TopHat on some sequencing data. I was wondering if anyone had any solutions to the problem?

    ./tophat -p 1 -G dmel-all-r6.18.gtf -o test.bam dmel_genome_6.18  read_1.fastq read_2.fastq



[2017-11-14 14:59:57] Beginning TopHat run (v2.1.0)

-----------------------------------------------

[2017-11-14 14:59:57] Checking for Bowtie

  Bowtie 2 not found, checking for older version..

  Bowtie version:   1.1.2.0

[2017-11-14 14:59:57] Checking for Bowtie index files (genome)..

[2017-11-14 14:59:57] Checking for reference FASTA file

Warning: Could not find FASTA file dmel_genome_6.18.fa

[2017-11-14 14:59:57] Reconstituting reference FASTA file from Bowtie index

  Executing: /Users/kmmeurs/Desktop/Programs/tophat-2.1.0.OSX_x86_64/bowtie-inspect dmel_genome_6.18 > test.bam/tmp/dmel_genome_6.18.fa

[2017-11-14 15:00:07] Generating SAM header for dmel_genome_6.18

[2017-11-14 15:00:07] Reading known junctions from GTF file

[2017-11-14 15:00:12] Preparing reads

[FAILED]

Error running 'prep_reads'

Error: qual length (95) differs from seq length (125) for fastq record !

Here is the header as well for one of the fastq files:

@HISEQ:249:C9MM3ANXX:7:1101:1733:2241 1:N:0:CTATAC
CGACAATCTTGCATGGCCGCGACTTCAGCNNNNNNNNNNNGTTTTTGCGCAATGCCGAACATTGCATGGGATAGGTCGTCGATGCGCCGGAATCCGTGGTCTCGAAATGATCGTCCAACTCAGCC
+
A=3BBGGGGGGGGGGGGGGGGDGGGGGGF###########==<EFGGEGG@GGGEDGGGGGGGCFCGGGD0ECBFGDGGGGGFGGGBGGG@AGG@CGGDEEB@D/6.C8EDEGGGD<EGGGGGGG
@HISEQ:249:C9MM3ANXX:7:1101:1803:2233 1:N:0:CTATAC
CTTAAAATAATTAATGTGTGTATTNNNNNNNNNNNNNNNNNNCACACACTAGAAATATACTTTGCCATCCATTAGGTGAAGGCCTAATCCAAGGCCTCCCTACCATGGATTGGCACAGATAAATT
+
CCCCCGGGGGGGGGGGGGGGGGGG##################===FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGFGEFGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGDGGG
@HISEQ:249:C9MM3ANXX:7:1101:1772:2234 1:N:0:CTATAC
TTCTCCTCCTCGGAGTCGCTGTAAANNNNNNNNNNNNNNNNTGACGGCTTTTGTTTACAATCCACCTTCTTTTTAATTTCTTCCTCATTGTAACCCGGAGGTGGAACGGGGGTAAGAGAGCGCCT

docsmb17:tophat-2.1.0.OSX_x86_64 kmmeurs$ head A31P_MYBPC3_Female_1_week_CTATAC_L007_R1_C9MM3ANXX.fastq -C ==> A31P_MYBPC3_Female_1_week_CTATAC_L007_R1_C9MM3ANXX.fastq <==

@HISEQ:249:C9MM3ANXX:7:1101:1733:2241 1:N:0:CTATAC
CGACAATCTTGCATGGCCGCGACTTCAGCNNNNNNNNNNNGTTTTTGCGCAATGCCGAACATTGCATGGGATAGGTCGTCGATGCGCCGGAATCCGTGGTCTCGAAATGATCGTCCAACTCAGCC
+
A=3BBGGGGGGGGGGGGGGGGDGGGGGGF###########==<EFGGEGG@GGGEDGGGGGGGCFCGGGD0ECBFGDGGGGGFGGGBGGG@AGG@CGGDEEB@D/6.C8EDEGGGD<EGGGGGGG
@HISEQ:249:C9MM3ANXX:7:1101:1803:2233 1:N:0:CTATAC
CTTAAAATAATTAATGTGTGTATTNNNNNNNNNNNNNNNNNNCACACACTAGAAATATACTTTGCCATCCATTAGGTGAAGGCCTAATCCAAGGCCTCCCTACCATGGATTGGCACAGATAAATT
+
CCCCCGGGGGGGGGGGGGGGGGGG##################===FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGFGEFGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGDGGG
@HISEQ:249:C9MM3ANXX:7:1101:1772:2234 1:N:0:CTATAC
TTCTCCTCCTCGGAGTCGCTGTAAANNNNNNNNNNNNNNNNTGACGGCTTTTGTTTACAATCCACCTTCTTTTTAATTTCTTCCTCATTGTAACCCGGAGGTGGAACGGGGGTAAGAGAGCGCCT

I saw another post similar to this but I couldn't figure out what they did to fix the problem (https://www.biostars.org/p/110412/). Any help would be fantastic! Thanks!!

ADD COMMENTlink modified 3 months ago by h.mon22k • written 13 months ago by williamsbrian5064120
1

The error indicates that something is wrong with your fastq file.

You should know that the old 'Tuxedo' pipeline of Tophat and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. (If you can't get access to that publication, let me know and I'll -cough- help you.) There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.

ADD REPLYlink modified 13 months ago • written 13 months ago by WouterDeCoster35k

Is there any way to fix the fastq file? Thanks for the advice by the way! I would have been struggling with "Tuxedo" nonsense for days.

ADD REPLYlink written 13 months ago by williamsbrian5064120

You'll first have to figure out which file is corrupt, and then why. Do you have the original data available? Which steps were taken before this attempted alignment?

ADD REPLYlink written 13 months ago by WouterDeCoster35k

I'm not entirely sure about that one. I am helping someone out on a project. They ran samples on an Illumina HiSeq so I'm assuming they got a large file that was then demultiplexed. It looks like the barcodes have been trimmed as well. The files were transferred to my external hard drive and I then transferred the files to my computer.

I could try getting the data again from my colleague?

ADD REPLYlink written 13 months ago by williamsbrian5064120

That's worth trying indeed.

ADD REPLYlink written 13 months ago by WouterDeCoster35k

You were right about the file being corrupt. I took it out of the command line and TopHat started working. That is nice to know when I try running HISAT2. Thanks for all the help!

ADD REPLYlink written 13 months ago by williamsbrian5064120

I tried the HISAT, StringTie, and Ballgown method today but I got a bit stuck at the R portion of it. I can't find much about the method really. I was wondering if you had any links?

ADD REPLYlink written 13 months ago by williamsbrian5064120

The paper contains a lot of R code, is that helpful? Or did you already check that?

ADD REPLYlink written 13 months ago by WouterDeCoster35k

I tried their R script and got to step 9 and got blocked. They even have the troubleshooting sections that identifies the same error that I'm getting (The Ballgown function results in an error that the first column of pData does not match the names of the folders containing the ballgown data). I couldn't get passed it... I felt like R studio was a bit more corporative which could have given me a bit more problems?

ADD REPLYlink written 13 months ago by williamsbrian5064120

I would suggest opening a separate question, containing your problem, the code you used and the errors you get. Please be as complete as possible.

ADD REPLYlink written 13 months ago by WouterDeCoster35k
1

Try validateFiles from Kent Utilities to find out the broken fastq record.

ADD REPLYlink written 13 months ago by genomax59k

Does it have to do with the index file? I had to generate my own?

ADD REPLYlink written 13 months ago by williamsbrian5064120

Hi

I am also getting similar error like this when running tophat

Error: qual length (114) differs from seq length (126) for fastq record !

Please suggest some solution. Any help is much appreciated.

Thanks

ADD REPLYlink modified 3 months ago • written 3 months ago by archana.bioinfo87100
1

Please do not use SUBMIT ANSWER window unless you are providing an answer to the original question.

It looks like your fastq file has at least one record which seems to be malformed (where the number of bases and Q scores don't match). I suggest that you run fastQValidator.

ADD REPLYlink written 3 months ago by genomax59k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1570 users visited in the last hour