Question

FASTQ- Reverse reads of paired end sequence worse than forward reads

0

Entering edit mode

6.8 years ago

dilawerkh4 • 0

Here is an interesting case and would appreciate feedback:

A 4000 year old human sample from Afghanistan was sequenced using Illumina MiSeq, using the paired end method. Modern human DNA contamination is around 10%, and the sample is subject to post mortem degradation.. The files were uploaded to SRA by Max Plank in Germany.

I fetched the file, SRR3970376 and formed split fastq files, forward and reverse reads. I have attached the output reports from FASTQC::

1-Forward reads

2-Reverse reads

Here are my questions:

1- Why are the reverse reads substantially lower quality than the forward ones. Flow cell overclustering? Issues with the sequencing machine? Degraded primer for the reverse reads? or something else?

2- It is odd that the phred quality scores on the forward reads are as high as 60 (reverse reads up to 35). I have never seen scores higher than 40.

3- Any other thoughts based on the totality of both outputs?

sequencing next-gen alignment Assembly • 5.7k views

ADD COMMENT • link 6.8 years ago by dilawerkh4 • 0

score 1 · Answer 1 · 2017-06-23

1

Entering edit mode

6.8 years ago

Brian Bushnell 20k

Read 2 is generally lower quality. Not sure why... maybe simply being blasted by lasers for days eventually degrades the DNA. But I think there are many reasons. But the quality is certainly not going to improve after days of exposure to radiation and reagents.

It sounds like your data is encoded in ASCII-64 (old Illumina quality scores scale). FastqQC probably has a flag to tell it the input quality format, but generally, I recommend converting it to ASCII-33 as soon as possible to prevent misprocessing the data somewhere. You can do that with the BBMap package:

reformat.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq qin=64 qout=33

ADD COMMENT • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

I suppose an option would be to process the forward reads only, but I will not have enough markers left for allele frequency based population history analysis.

I am inclined to think it may not be a good idea to quality trim the reverse reads only, and then align the 2 files together with bwa mem

ADD REPLY • link 6.8 years ago by dilawerkh4 • 0

1

Entering edit mode

It's best to use both reads together for mapping, even if the second read is low quality, for the best precision (human reference has a lot of repetitive stuff for which paired reads really help determine which repeat copy to map to). You can do quality trimming after alignment (though if you get a really low alignment rate you may want to lightly trim prior to alignment [to, say, Q10] to see if that helps).

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Makes sense. I was also thinking in terms of file 1 ending up with more lines than file 2, which may be problematic.

ADD REPLY • link 6.8 years ago by dilawerkh4 • 0

0

Entering edit mode

@Brian: This data was uploaded in May 2017. It is highly unlikely to be in ASCII-64 format.

This is data from temporal bone and will likely need special handling. It may be best to follow methods published by the Max Planck lab who are experts in this type of fragmented data.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

How strange... Illumina platforms never generate Q-scores over 41 in my experience. I wonder if something odd happened when converting it to SRA?

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

See Illumina tech support's answer quoted by @Paul in this thread. Scores as high as 45 are allowed.

ADD REPLY • link 6.8 years ago by GenoMax 141k

GenoMax · Answer 2 · 2017-06-23

0

Entering edit mode

6.8 years ago

dilawerkh4 • 0

Brian,

I am impressed with BBmap. While repairing with repair.sh in=broken.fq out=fixed.fq outs=singletons.fq repair

I got a message:

The ASCII quality encoding offset (64) is not set correctly, or the reads are corrupt; quality value below -5. Please re-run with the flag 'qin=33' or 'ignorebadquality'. Problematic read number 24:

@SRR3970376.49 49 length=20 CCACCGGGGGGCCATGCCGT +SRR3970376.49 49 length=20 @@@@@@@@@@@@@@@@@T@\

So I ran with qin=33 and piped to fastqc. The scores are now calibrated to under 40, which is reasonable.

I also learned the following with BBmap:

Input: 1653053 reads 85461620 bases. Result: 1635944 reads (98.97%) 85136549 bases (99.62%) Pairs: 0 reads (0.00%) 0 bases (0.00%) Singletons: 1635944 reads (98.97%) 85136549 bases (99.62%)

Time: 0.937 seconds. Reads Processed: 1653k 1764.41k reads/sec Bases Processed: 85461k 91.22m bases/sec

which basically tells me that the file has been improperly merged or formatted, and is a single read file. I suppose the command:

./fastq-dump --split-files SRR3970376.sra

split the files into 2 files regardless of whether the file was paired end encoded

ADD COMMENT • link 6.8 years ago by dilawerkh4 • 0

1

Entering edit mode

Note that repair.sh assumes original Illumina read names, or pairs in which read 1 and read 2 have identical names, which it uses to match pairs. These reads have been renamed, so it may not work correctly...

Specifically, if R1 and R2 have identical names (instead of ending with /1 and /2 or 1: and 2:), you need to add the flag "ain" (allowidenticalnames) for repair.sh to work properly. If they have neither identical names nor normal Illumina names, it won't work at all.

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian,

These are the 1st few lines from the fastq generated with fastq-dump from SRA. What do you think?

@SRR3970376.1 1 length=17

GCCGCGCCGGTGATGGT

+SRR3970376.1 1 length=17

]]]]]]]]]]]]]]]]]

@SRR3970376.2 2 length=10

AGTCGGCCGT

+SRR3970376.2 2 length=10

]]]]]]]]]]

@SRR3970376.3 3 length=18

GCCGGCGTCTCCGGTCTT

+SRR3970376.3 3 length=18

]]]]]]]]]]]]]]]]]]

@SRR3970376.4 4 length=15

ATTTCCACTGTCATC

+SRR3970376.4 4 length=15

]]]]]]]]]]]]]]]

@SRR3970376.5 5 length=52

TCCACCTCAGCCTCCCAAGTAGCTGGTACTACAGGCACATGCCACCAAACTT

+SRR3970376.5 5 length=52

]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]

@SRR3970376.6 6 length=22

TCGGGCTGACCGCGCTCACCGT

ADD REPLY • link updated 6.8 years ago by GenoMax 141k • written 6.8 years ago by dilawerkh4 • 0

0

Entering edit mode

To get the original illumina read ID's try fastq-dump with -F option.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

This is what I get using the -F option:

@1

GCCGCGCCGGTGATGGT

+1

]]]]]]]]]]]]]]]]]

@2

AGTCGGCCGT

+2

]]]]]]]]]]

@3

GCCGGCGTCTCCGGTCTT

ADD REPLY • link 6.8 years ago by dilawerkh4 • 0

1

Entering edit mode

Sadly the original data submitters appear to have modified the default Illumina fastq headers, so there is not much you can do. Looking at the short length of reads this does not appear to be a paired-end dataset or if it is then the inserts must be very small.

Your best bet is to contact the original submitters to inquire, if you need to use this dataset.

ADD REPLY • link 6.8 years ago by GenoMax 141k