Here is an interesting case and would appreciate feedback:
A 4000 year old human sample from Afghanistan was sequenced using Illumina MiSeq, using the paired end method. Modern human DNA contamination is around 10%, and the sample is subject to post mortem degradation.. The files were uploaded to SRA by Max Plank in Germany.
I fetched the file, SRR3970376 and formed split fastq files, forward and reverse reads. I have attached the output reports from FASTQC::
Here are my questions:
1- Why are the reverse reads substantially lower quality than the forward ones. Flow cell overclustering? Issues with the sequencing machine? Degraded primer for the reverse reads? or something else?
2- It is odd that the phred quality scores on the forward reads are as high as 60 (reverse reads up to 35). I have never seen scores higher than 40.
3- Any other thoughts based on the totality of both outputs?
I suppose an option would be to process the forward reads only, but I will not have enough markers left for allele frequency based population history analysis.
I am inclined to think it may not be a good idea to quality trim the reverse reads only, and then align the 2 files together with bwa mem
It's best to use both reads together for mapping, even if the second read is low quality, for the best precision (human reference has a lot of repetitive stuff for which paired reads really help determine which repeat copy to map to). You can do quality trimming after alignment (though if you get a really low alignment rate you may want to lightly trim prior to alignment [to, say, Q10] to see if that helps).
Makes sense. I was also thinking in terms of file 1 ending up with more lines than file 2, which may be problematic.
@Brian: This data was uploaded in May 2017. It is highly unlikely to be in ASCII-64 format.
This is data from temporal bone and will likely need special handling. It may be best to follow methods published by the Max Planck lab who are experts in this type of fragmented data.
How strange... Illumina platforms never generate Q-scores over 41 in my experience. I wonder if something odd happened when converting it to SRA?
See Illumina tech support's answer quoted by @Paul in this thread. Scores as high as 45 are allowed.