Subsampling fastq.gz reads using reformat.sh changed all the quality scores?
2
0
Entering edit mode
4.4 years ago

I want to subset 80M paired-end reads from fastq.gz files that contain 120M paired-end reads. I used the reformat.sh package, and here is my code:

$REFORMAT/reformat.sh in1=B4337_4_R1.fastq.gz in2=B4337_4_R2.fastq.gzout1=$OUT/B4337_4_80M_R1.fastq.gz out2=$OUT/B4337_4_80M_R2.fastq.gz samplereadstarget=80000000

After running the script, I found that 80M reads were sampled, but all the quality scores have been changed:

In the old fastq.gz file, I had one read:

@K00208:YAP076:7:1101:18.42:12.06#0/1
NGTTAAGAGCATGAATCTTACTACTGAATGATCTTAAACAAGTTACAGCAGGTCCTCAAACGACATCAGTTTCATTCAACACTGTTTTCTTATGATTTTGA
+
@```[ieeeiiiiieieeiiiiiiiieiieeeiiiiiiiiiieiiiiiiie``eeiiiiieVeiee`iieeeeiieiii`ieee``iieieii`````iii

Whereas in the new fastq.gz file, I had the following:

@K00208:YAP076:7:1101:18.42:12.06#0/1
NGTTAAGAGCATGAATCTTACTACTGAATGATCTTAAACAAGTTACAGCAGGTCCTCAAACGACATCAGTTTCATTCAACACTGTTTTCTTATGATTTTGA
+
!AAA<JFFFJJJJJFJFFJJJJJJJJFJJFFFJJJJJJJJJJFJJJJJJJFAAFFJJJJJF7FJFFAJJFFFFJJFJJJAJFFFAAJJFJFJJAAAAAJJJ

This is very weird to me, and it interferes with my next step, which is to use NGmerge.sh to trim the adaptors. Does anyone know what may be wrong with my code? How may I fix the issue? Thanks, Jenny

sequence • 2.9k views
ADD COMMENT
0
Entering edit mode
4.4 years ago

This is not weird, this is what it was supposed to do, reformat.sh itself says:

Written by Brian Bushnell Last modified June 27, 2016 Description: Reformats reads to change ASCII quality encoding, interleaving, file format, or compression format. Optionally performs additional functions such as quality trimming, subsetting, and subsampling. Supports sam, fastq, fasta, fasta+qual, scarf, gzip, zip.

In your case, you had Phred-64 data, but after running reformat it is converted into it's equivalent Phred-33 value.

@ converted into ! ' converted into A [ converted into <

and so on.

enter image description here

If you want just want subset of 80M reads why don't you try this command (Only if you are on linux)

head -80000000 in.fq >out.fq

ADD COMMENT
0
Entering edit mode

While the in-line help does say what you quoted above, that text does NOT mean that that is what reformat.sh does by default. It just means that one is able to change the encoding if one wants to.

ADD REPLY
0
Entering edit mode
4.4 years ago
GenoMax 141k

yinjiani900129 : AFAIK reformat.sh should not change quality scores in your data. The default values for quality score handling options are set to auto:

qin=auto                ASCII offset for input quality.  May be 33 (Sanger), 64 (Illumina), or auto.
qout=auto               ASCII offset for output quality.  May be 33 (Sanger), 64 (Illumina), or auto (same as input).

Which means that quality scores should be left in the same encoding as before.

If for some reason reformat.sh is unable to identify original encoding in your data and you know that your data is in Sanger format then add qin=33 qout=33 to your command above to keep data in Sanger format. If the original data is in Illumina format then you could either leave in in that format by doing qin=64 qout=64 or change it to Sanger format by qin=64 qout=33.

Looking at the fastq headers you have, it is appears that your data is in Illumina (phread+64) format.

ADD COMMENT
0
Entering edit mode

I think what you said makes a lot of sense.

I tried:

$REFORMAT/reformat.sh in1=B4337_4_R1.fastq.gz in2=B4337_4_R2.fastq.gzout1=$OUT/B4337_4_80M_R1.fastq.gz out2=$OUT/B4337_4_80M_R2.fastq.gz samplereadstarget=80000000 qin=64 qout=64

But got errors:
Set INTERLEAVED to false
Exception in thread "main" java.lang.NoSuchMethodError: java.lang.Process.isAlive()Z
    at dna.Data.testExecute(Data.java:1769)
    at dna.Data.PIGZ(Data.java:1577)
    at fileIO.ReadWrite.getGZipInputStream(ReadWrite.java:1030)
    at fileIO.ReadWrite.getInputStream(ReadWrite.java:847)
    at fileIO.ByteFile1.open(ByteFile1.java:330)
    at fileIO.ByteFile1.<init>(ByteFile1.java:100)
    at fileIO.ByteFile2$BF1Thread.<init>(ByteFile2.java:237)
    at fileIO.ByteFile2.open(ByteFile2.java:215)
    at fileIO.ByteFile2.<init>(ByteFile2.java:86)
    at fileIO.ByteFile.makeByteFile(ByteFile.java:35)
    at fileIO.ByteFile.makeByteFile(ByteFile.java:27)
    at stream.FastqReadInputStream.<init>(FastqReadInputStream.java:59)
    at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:111)
    at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:64)
    at jgi.ReformatReads.countReads(ReformatReads.java:1360)
    at jgi.ReformatReads.process(ReformatReads.java:449)
    at jgi.ReformatReads.main(ReformatReads.java:51)
ADD REPLY
0
Entering edit mode

Did you move any of the files in bbmap software after you downloaded and uncompressed the archive?

ADD REPLY

Login before adding your answer.

Traffic: 1778 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6