Error while using the Cutadapt 2.6 ouput fastq file as input for alignment Rsubread
2
0
Entering edit mode
4.5 years ago
kubano • 0

I have cut the 3`adapters from my RNA-Seq sequences with cutadapt 2.6 and when loading the trimmed sequences for alignment in Rsubread, program is aborted after several lines with the "ERROR: a format issue @ is found on the 393884-th line in input file".

I did not find anything about this issue in any of the manuals for either cutadapt or Rsubread. I am also quiet new to this, so don´t know if I did not just overlooked something obvious...

Would anyone know how to proceed further, please?

RNA-Seq software error alignment • 1.7k views
ADD COMMENT
0
Entering edit mode

Output of head -n 393884 your.fastq | tail?

ADD REPLY
0
Entering edit mode

Did not work for some reason, so I tried in R:

x <- scan('my.fastq', '', skip = 393884, nlines = 1, sep = '\n')

Read 1 item

x

[1] "@7001425F:195:CDYHMANXX:3:1102:19436:35854 1:N:0:CACTCA"

ADD REPLY
0
Entering edit mode

Can you show a few lines before and after than. The issue is probably somewhere there.

ADD REPLY
0
Entering edit mode

It seems there is missing line, for some reason:

 x <- scan('file.fastq', '', skip = 393870, nlines = 20, sep = '\n')

Read 18 items

   x



[1] "+"                                                      
 [2] "BBBBBF/FFB///<FBFB/<FFB<B</<BB//FBFF<FFFF<FF<F/F/FBF///"
 [3] "@7001425F:195:CDYHMANXX:3:1102:19146:35958 1:N:0:CACTCA"
 [4] "CGTCCTCGCCCGCGCGAATGCGGCCCAGCTTGTTGAGCAAGGCAACGGCCGCCGT"
 [5] "+"                                                      
 [6] "BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
 [7] "@7001425F:195:CDYHMANXX:3:1102:19288:35764 1:N:0:CACTCA"
 [8] "CTTCGCGATCGATGTCGATGGTGCGCAGCAGTTCGCGCACGGCCTCGGCGCCCAT"
 [9] "+"                                                      
[10] "BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
[11] "@7001425F:195:CDYHMANXX:3:1102:19414:35790 1:N:0:CACTCA"

**[12]"+"                                                      
[13] "@7001425F:195:CDYHMANXX:3:1102:19436:35854 1:N:0:CACTCA"
[14] "CGGGCTGCTGCACGCCGCGCAGGATGCCGTTGAGAGCCCCGGTCAGCAAGGAAGT"**

[15] "+"                                                      
[16] "BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
[17] "@7001425F:195:CDYHMANXX:3:1102:19382:35879 1:N:0:CACTCA"
[18] "GGCAGTTTCTTAAGAGCGTCGATGTCGTAGGCATTCTGCAACGAGCCCGCGCCTT"
ADD REPLY
1
Entering edit mode

Yes, the file is corrupted as for "@7001425F:195:CDYHMANXX:3:1102:19414:35790 1:N:0:CACTCA" two lines are missing. Consider to use e.g. repair.sh from BBmap to try and fix the file, discarding the corrupted part.

ADD REPLY
0
Entering edit mode

Thank you for your help. But as this is becoming far too "black box" for me and as there is only small fraction of sequences with adapters, I´ll probably try to go for alignment without trimming. Or try different tool to cut the adapter. Is it legitimate?

ADD REPLY
0
Entering edit mode

Filtering all the reads (minimal length was set to 10) solved the issue!

Thank you all for your help!

ADD REPLY
3
Entering edit mode
4.5 years ago

My suspicion is that cutadapt trimmed entire sequence (and corresponding quality values) based on trimming parameters . From the trimmed reads, filter all the reads with sequence length <1 or use some minimum length to retain during trimming (for eg. -m for cutadapt) @ kubano

ADD COMMENT
0
Entering edit mode

This is worth of trying, thank you!

ADD REPLY
0
Entering edit mode

@ kubano try with seqkit to check for records/reads with no sequences:

$ seqkit seq -M 0 <input.fastq/fastq.gz>

to count the reads without sequences, try:

$ seqkit seq -M 0 <input.fastq/fastq.gz> | seqkit stats

Column with number of sequences would list the number of reads

ADD REPLY

Login before adding your answer.

Traffic: 1718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6