problem about SHRiMP2 solving AB SOLID
1
0
Entering edit mode
7.2 years ago
laihui126cn ▴ 40

I want to do the mapping about AB SOLID data. They are paired-end sequencing. The format of .fastq files show below:

pair1:

@SRR586064.1ugc_357_358_MatePair_2x50bp_solid0032_20100528_MP_ugc_357_854_13_97/1
T30..01121.12.1032100213131122200031222022101302313
+
!AB!!?:>@<!@B!;AB?AA@@<2<@?@@<?AB>A?9?:;@@>;-=>>7=@


pair2:

@SRR586064.1ugc_357_358_MatePair_2x50bp_solid0032_20100528_MP_ugc_357_854_13_97/2
G10330000122222033220201000000220002000000000000000
+
!@BBABBBB@>?@@(.))35.%-.3((%1+%((82-'.*-*/3*14*'696


I try use the tool:SHRiMP2, but it is not friendly to solve my datasets. I run the command:

gmapper-cs -1 pair1.fastq -2 pair2.fastq -L ./test/Sscro --pair-mode opp-in -Q --trim-first --qv-offset 33 >pair12.sam 2>pair12.log


And I got the info in the end of file .log:

note: detected fastq format in input file [pair1.fastq]
- Processing read files [pair1.fastq , pair2.fastq]
note: quality value format not set explicitly; using PHRED+33
done r/hr r/core-hr
There has been a problem reading in the read "SRR586064.1", the quality length exceeds the sequence length!
Are you using the right executable? gmapper-cs for color space? and gmapper-ls for letter space?


I have tried some parameters of SHRiMP2, but failed. And I can't find any similar example from SHRiMP2 website. Does anybody get the same problem and have you solve it ? Or maybe I should use another tool, any suggestion?

alignment SHRiMP2 cs fastq SOLID • 2.7k views
0
Entering edit mode

grep -A4 SRR586064.1 (I think its the very fast read) from your fastq file and paste here. The error is pretty obvious and is due to inconsistency in length between the read sequence and its quality score string. Did you trim your reads before using some trimming tool ? Or may be you are not providing the arguments in the right manner. For example, you used --trim-first but didn't mention any integer value. Please read the manual carefully and try again.

0
Entering edit mode

Thank you, Pandey. I have solved this problem. Have you ever used the tool SHRiMP2? I have no idea to understand his arguments, and there is no example in SHRiMP2 README.

Could you help me?

0
Entering edit mode

Ya I have used it a lot when I used to work with color space reads. If you have no idea about the arguments I would advise you to go with default arguments. They have been well-tested and should produce optimal results unless you are trying to do something really different.

0
Entering edit mode

So happy to hear from you. I have tried default arguments, and got the terrible result:

reads matched(17.4191%)


I used the command:

gmapper-cs SRR_test.fastq \
--local \
-Q -N 15 \
--all-contigs \
--sigle-best-mapping >SRR_test.sam 2>SRR_test.log


Another better result：

reads matched (54.3101%)


I used the command:

gmapper-cs SRR_test.fastq \
--local \
-Q -N 15 \
-r 45% \
-v 20% \
-h 20% \
--all-contigs \
--sigle-best-mapping >SRR_test.sam 2>SRR_test.log


Can you give me some suggestion according to my command? Thanks.

0
Entering edit mode

Hi, I'm using gmapper-cs in SHRiMP 2.2.3, and facing the same problem the quality length exceeds the sequence length.

I only use the default parameter -N 28 -E but it still raising this error.

How you successfully solved this problem?

Thanks~

0
Entering edit mode
maybe my script could help u. I trimed the first QV ,but the first base isn't changing.you can write a script ,or I give it to u tomorrow morning. sorry to reply by phone
0
Entering edit mode
my($file1,$file2)=@ARGV;
open IN,"<$file1" or die "cant't open the input file"; open OUT,">>$file2" or die "Please enter the output filename";
while(<IN>){
print OUT;
$_=<IN>; print OUT;$_=<IN>;
print OUT;
$_=<IN>; my$cutTheFirstSubstr=substr($_,1); print OUT "$cutTheFirstSubstr";
}
close IN;
close OUT;

0
Entering edit mode

Thank you very much and sorry to reply late, I understand now.

But why you choose to trim the first QV but not the last? Will the mapping process make use of the QV ?

0
Entering edit mode

I could let you know that my cs-fastq file is looks like this:

@SRR_2x50bp_solid0032_20100528_MP_ugc_357_854_13_526/1
T11..02300.23.2232000120031323333021202133212323310
+
!%(!!;%&@1!'/!%7%&A?;&(792)'27@'%.@<%%4+<)'(),%%&*-

@SRR_MatePair_2x50bp_solid0032_20100528_MP_ugc_357_854_13_174/1
T12..11220.00.0210230201212023232220021300333211031
+
!AB!!@BBAB!>;!AA@?@B>AABA7>BBB:@AB?>>@?A>>ABB=AB6@;
@SRR_MatePair_2x50bp_solid0032_20100528_MP_ugc_357_854_13_207/1
T30..01011.01.0103220033000022030032232200322022112
+
!B9!!;>5+5!48!4:,??%%+=;6<4%>;2@&%<?)81=+5%)@92?&1=
@SRR_MatePair_2x50bp_solid0032_20100528_MP_ugc_357_854_13_282/1
T21..12213.31.1321000232210220312000223220220031003
+
!%%!!&*(&%!,&!&''&(('&&)'%*)&&()((&(*()%'&('&('(*&&


Significantly,all the first QV is "!" corresponding the first nucleic-color base.The character '!' represents the lowest quality. So this may be a symbol to differ from others.

0
Entering edit mode

Um,I understand. Differentially, in my files, the sequences are longer than QV, so I also trim them from tail. SOLID format is not intuitive.

0
Entering edit mode
7.0 years ago
laihui126cn ▴ 40

Could you paste some reads of yours? That is easy to see if our SOLID reads are similar.

0
Entering edit mode

I am having similar problems. I have downloaded the fastq files from dbGaP SRAtools.

This is what my reads look like:

@SRR957099.1 1_14_37 length=50
T201..133...01....1....2...........................
+SRR957099.1 1_14_37 length=50
!@@@!!@@@!!!@@!!!!@!!!!@!!!!!!!!!!!!!!!!!!!!!!!!!!!
@SRR957099.2 1_14_50 length=50
T311..320...00....0....2...........................
+SRR957099.2 1_14_50 length=50
!@@@!!@@@!!!@@!!!!@!!!!@!!!!!!!!!!!!!!!!!!!!!!!!!!!


Do I need to take out the ! at the beginning of each QV line?

0
Entering edit mode

maybe you need to filter your csfasta and qual files using the solid pipeline preprocessing (see QC pipeline https://github.com/fls-bioinformatics-core/genomics), the same convert your files in fastq.