Question: FASTQ- Reverse reads of paired end sequence worse than forward reads
0
gravatar for dilawerkh4
2.4 years ago by
dilawerkh40
dilawerkh40 wrote:

Here is an interesting case and would appreciate feedback:

A 4000 year old human sample from Afghanistan was sequenced using Illumina MiSeq, using the paired end method. Modern human DNA contamination is around 10%, and the sample is subject to post mortem degradation.. The files were uploaded to SRA by Max Plank in Germany.

I fetched the file, SRR3970376 and formed split fastq files, forward and reverse reads. I have attached the output reports from FASTQC::

1-Forward reads

2-Reverse reads

Here are my questions:

1- Why are the reverse reads substantially lower quality than the forward ones. Flow cell overclustering? Issues with the sequencing machine? Degraded primer for the reverse reads? or something else?

2- It is odd that the phred quality scores on the forward reads are as high as 60 (reverse reads up to 35). I have never seen scores higher than 40.

3- Any other thoughts based on the totality of both outputs?

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by dilawerkh40
1
gravatar for Brian Bushnell
2.4 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Read 2 is generally lower quality. Not sure why... maybe simply being blasted by lasers for days eventually degrades the DNA. But I think there are many reasons. But the quality is certainly not going to improve after days of exposure to radiation and reagents.

It sounds like your data is encoded in ASCII-64 (old Illumina quality scores scale). FastqQC probably has a flag to tell it the input quality format, but generally, I recommend converting it to ASCII-33 as soon as possible to prevent misprocessing the data somewhere. You can do that with the BBMap package:

reformat.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq qin=64 qout=33
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Brian Bushnell16k

I suppose an option would be to process the forward reads only, but I will not have enough markers left for allele frequency based population history analysis.

I am inclined to think it may not be a good idea to quality trim the reverse reads only, and then align the 2 files together with bwa mem

ADD REPLYlink written 2.4 years ago by dilawerkh40
1

It's best to use both reads together for mapping, even if the second read is low quality, for the best precision (human reference has a lot of repetitive stuff for which paired reads really help determine which repeat copy to map to). You can do quality trimming after alignment (though if you get a really low alignment rate you may want to lightly trim prior to alignment [to, say, Q10] to see if that helps).

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k

Makes sense. I was also thinking in terms of file 1 ending up with more lines than file 2, which may be problematic.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by dilawerkh40

@Brian: This data was uploaded in May 2017. It is highly unlikely to be in ASCII-64 format.

This is data from temporal bone and will likely need special handling. It may be best to follow methods published by the Max Planck lab who are experts in this type of fragmented data.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by genomax74k

How strange... Illumina platforms never generate Q-scores over 41 in my experience. I wonder if something odd happened when converting it to SRA?

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k

See Illumina tech support's answer quoted by @Paul in this thread. Scores as high as 45 are allowed.

ADD REPLYlink written 2.4 years ago by genomax74k
0
gravatar for dilawerkh4
2.4 years ago by
dilawerkh40
dilawerkh40 wrote:

Brian,

I am impressed with BBmap. While repairing with repair.sh in=broken.fq out=fixed.fq outs=singletons.fq repair

I got a message:

The ASCII quality encoding offset (64) is not set correctly, or the reads are corrupt; quality value below -5. Please re-run with the flag 'qin=33' or 'ignorebadquality'. Problematic read number 24:

@SRR3970376.49 49 length=20 CCACCGGGGGGCCATGCCGT +SRR3970376.49 49 length=20 @@@@@@@@@@@@@@@@@T@\

So I ran with qin=33 and piped to fastqc. The scores are now calibrated to under 40, which is reasonable.

I also learned the following with BBmap:

Input: 1653053 reads 85461620 bases. Result: 1635944 reads (98.97%) 85136549 bases (99.62%) Pairs: 0 reads (0.00%) 0 bases (0.00%) Singletons: 1635944 reads (98.97%) 85136549 bases (99.62%)

Time: 0.937 seconds. Reads Processed: 1653k 1764.41k reads/sec Bases Processed: 85461k 91.22m bases/sec

which basically tells me that the file has been improperly merged or formatted, and is a single read file. I suppose the command:

./fastq-dump --split-files SRR3970376.sra

split the files into 2 files regardless of whether the file was paired end encoded

ADD COMMENTlink written 2.4 years ago by dilawerkh40
1

Note that repair.sh assumes original Illumina read names, or pairs in which read 1 and read 2 have identical names, which it uses to match pairs. These reads have been renamed, so it may not work correctly...

Specifically, if R1 and R2 have identical names (instead of ending with /1 and /2 or 1: and 2:), you need to add the flag "ain" (allowidenticalnames) for repair.sh to work properly. If they have neither identical names nor normal Illumina names, it won't work at all.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Brian Bushnell16k

Hi Brian,

These are the 1st few lines from the fastq generated with fastq-dump from SRA. What do you think?

@SRR3970376.1 1 length=17

GCCGCGCCGGTGATGGT

+SRR3970376.1 1 length=17

]]]]]]]]]]]]]]]]]

@SRR3970376.2 2 length=10

AGTCGGCCGT

+SRR3970376.2 2 length=10

]]]]]]]]]]

@SRR3970376.3 3 length=18

GCCGGCGTCTCCGGTCTT

+SRR3970376.3 3 length=18

]]]]]]]]]]]]]]]]]]

@SRR3970376.4 4 length=15

ATTTCCACTGTCATC

+SRR3970376.4 4 length=15

]]]]]]]]]]]]]]]

@SRR3970376.5 5 length=52

TCCACCTCAGCCTCCCAAGTAGCTGGTACTACAGGCACATGCCACCAAACTT

+SRR3970376.5 5 length=52

]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]

@SRR3970376.6 6 length=22

TCGGGCTGACCGCGCTCACCGT
ADD REPLYlink modified 2.4 years ago by genomax74k • written 2.4 years ago by dilawerkh40

To get the original illumina read ID's try fastq-dump with -F option.

ADD REPLYlink written 2.4 years ago by genomax74k

This is what I get using the -F option:

@1

GCCGCGCCGGTGATGGT

+1

]]]]]]]]]]]]]]]]]

@2

AGTCGGCCGT

+2

]]]]]]]]]]

@3

GCCGGCGTCTCCGGTCTT

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by dilawerkh40
1

Sadly the original data submitters appear to have modified the default Illumina fastq headers, so there is not much you can do. Looking at the short length of reads this does not appear to be a paired-end dataset or if it is then the inserts must be very small.

Your best bet is to contact the original submitters to inquire, if you need to use this dataset.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by genomax74k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1777 users visited in the last hour