Question

Error while using the Cutadapt 2.6 ouput fastq file as input for alignment Rsubread

0

Entering edit mode

4.5 years ago

kubano • 0

I have cut the 3`adapters from my RNA-Seq sequences with cutadapt 2.6 and when loading the trimmed sequences for alignment in Rsubread, program is aborted after several lines with the "ERROR: a format issue @ is found on the 393884-th line in input file".

I did not find anything about this issue in any of the manuals for either cutadapt or Rsubread. I am also quiet new to this, so don´t know if I did not just overlooked something obvious...

Would anyone know how to proceed further, please?

RNA-Seq software error alignment • 1.7k views

ADD COMMENT • link updated 4.5 years ago by cpad0112 21k • written 4.5 years ago by kubano • 0

0

Entering edit mode

Output of head -n 393884 your.fastq | tail?

ADD REPLY • link 4.5 years ago by ATpoint 82k

0

Entering edit mode

Did not work for some reason, so I tried in R:

x <- scan('my.fastq', '', skip = 393884, nlines = 1, sep = '\n')

Read 1 item

x

[1] "@7001425F:195:CDYHMANXX:3:1102:19436:35854 1:N:0:CACTCA"

ADD REPLY • link 4.5 years ago by kubano • 0

0

Entering edit mode

Can you show a few lines before and after than. The issue is probably somewhere there.

ADD REPLY • link 4.5 years ago by ATpoint 82k

0

Entering edit mode

It seems there is missing line, for some reason:

 x <- scan('file.fastq', '', skip = 393870, nlines = 20, sep = '\n')

Read 18 items

   x



[1] "+"                                                      
 [2] "BBBBBF/FFB///<FBFB/<FFB<B</<BB//FBFF<FFFF<FF<F/F/FBF///"
 [3] "@7001425F:195:CDYHMANXX:3:1102:19146:35958 1:N:0:CACTCA"
 [4] "CGTCCTCGCCCGCGCGAATGCGGCCCAGCTTGTTGAGCAAGGCAACGGCCGCCGT"
 [5] "+"                                                      
 [6] "BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
 [7] "@7001425F:195:CDYHMANXX:3:1102:19288:35764 1:N:0:CACTCA"
 [8] "CTTCGCGATCGATGTCGATGGTGCGCAGCAGTTCGCGCACGGCCTCGGCGCCCAT"
 [9] "+"                                                      
[10] "BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
[11] "@7001425F:195:CDYHMANXX:3:1102:19414:35790 1:N:0:CACTCA"

**[12]"+"                                                      
[13] "@7001425F:195:CDYHMANXX:3:1102:19436:35854 1:N:0:CACTCA"
[14] "CGGGCTGCTGCACGCCGCGCAGGATGCCGTTGAGAGCCCCGGTCAGCAAGGAAGT"**

[15] "+"                                                      
[16] "BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
[17] "@7001425F:195:CDYHMANXX:3:1102:19382:35879 1:N:0:CACTCA"
[18] "GGCAGTTTCTTAAGAGCGTCGATGTCGTAGGCATTCTGCAACGAGCCCGCGCCTT"

ADD REPLY • link updated 4.5 years ago by ATpoint 82k • written 4.5 years ago by kubano • 0

1

Entering edit mode

Yes, the file is corrupted as for "@7001425F:195:CDYHMANXX:3:1102:19414:35790 1:N:0:CACTCA" two lines are missing. Consider to use e.g. repair.sh from BBmap to try and fix the file, discarding the corrupted part.

ADD REPLY • link 4.5 years ago by ATpoint 82k

0

Entering edit mode

Thank you for your help. But as this is becoming far too "black box" for me and as there is only small fraction of sequences with adapters, I´ll probably try to go for alignment without trimming. Or try different tool to cut the adapter. Is it legitimate?

ADD REPLY • link 4.5 years ago by kubano • 0

0

Entering edit mode

Filtering all the reads (minimal length was set to 10) solved the issue!

Thank you all for your help!

ADD REPLY • link 4.5 years ago by kubano • 0

score 3 · Accepted Answer · 2019-10-31

3

Entering edit mode

4.5 years ago

cpad0112 21k

My suspicion is that cutadapt trimmed entire sequence (and corresponding quality values) based on trimming parameters . From the trimmed reads, filter all the reads with sequence length <1 or use some minimum length to retain during trimming (for eg. -m for cutadapt) @ kubano

ADD COMMENT • link 4.5 years ago by cpad0112 21k

0

Entering edit mode

This is worth of trying, thank you!

ADD REPLY • link 4.5 years ago by kubano • 0

0

Entering edit mode

@ kubano try with seqkit to check for records/reads with no sequences:

$ seqkit seq -M 0 <input.fastq/fastq.gz>

to count the reads without sequences, try:

$ seqkit seq -M 0 <input.fastq/fastq.gz> | seqkit stats

Column with number of sequences would list the number of reads

ADD REPLY • link 4.5 years ago by cpad0112 21k