Question: reduced quality in a paired-end R2 reads
1
gravatar for Assa Yeroslaviz
2.9 years ago by
Assa Yeroslaviz1.2k
Munich
Assa Yeroslaviz1.2k wrote:

Hi,

we are having some trouble with our chip-seq experiment. we're sequencing yeast in a paired-end mode (2X76b length), we have eight samples with four biological replicates. The samples were barcoded and multiplexed over the four lanes of a flow cell.

the quality of the data is not as good as we would like it to be, but what I am wondering most is the fact, that the reads from the reverse strand (R2) show a much lesser quality than the forward strand (R1) reads. And this is consistent over all eight samples.

I don't think this is a lane-specific problem. As you can see in the attached pictures below, the forward strand (R1) looks better than the reverse strand (R2) independent of the lane and or sample.

As I can't think of any biological solution for the problem (is there?), I would appreciate if someone has already encountered this kind of problems with his/her data and can share the experience. Is there a technical problem here?

thanks

Assa

.

attachment:

fastqc rna-seq chip-seq quality qc • 2.6k views
ADD COMMENTlink modified 2.9 years ago by Brian Bushnell16k • written 2.9 years ago by Assa Yeroslaviz1.2k

What sequencer was this done on (I am guessing a NextSeq) and what was the cluster density?
Have you tried a scan/trim program to eliminate the possibility that you have short inserts and you are just sequencing adapters on R2 end.

ADD REPLYlink written 2.9 years ago by genomax65k

I haven't tried that yet for this data set. for an earlier data set I did run both sickle and trim_galore to remove low quality and over-represented reads. The data looked better after that, but it reduced the total library size dramatically (sometimes more than 50%).

Where can I see the cluster density? Is it something I can get from the sequencer or do i need to calculate it on my own?

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Assa Yeroslaviz1.2k

You can download sequence analysis viewer from Illumina (Windows only) and then point it to the folder containing the raw data. You will be able to see a lot more detail from the run. One of the things you can see is the cluster density. Since this is a NextSeq run that should be between 150-200 K/mm^2.

If this is a new sequencer then these could just be "teething problems" as your techs get used to the instrument, refine concentration estimation, run procedures etc. For any run that looks less than optimal you should contact Illumina tech support and have them remote in (or if your sequencer is disconnected from the network then you will need to send some files in) to look at the run. This helps eliminate hardware/software/reagent issues.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax65k

for an earlier data set I did run both sickle and trim_galore to remove low quality and over-represented reads. The data looked better after that, but it reduced the total library size dramatically (sometimes more than 50%).

You should be very careful about low quality and over-represented reads. They may be there for a reason and unless the submitter wants you to remove them there is no reason for a core-facility to even look at them. If the run fails your overall criteria of average quality for a "good run" then that is a different issue.

The data size reduction seems to indicate that you must have a large amount of primer dimers (or short inserts) in that dataset.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax65k

You wrote chip-seq but tagged 'rna-seq', but this is DNA sequencing right?

ADD REPLYlink written 2.9 years ago by WouterDeCoster38k
0
gravatar for igor
2.9 years ago by
igor7.6k
United States
igor7.6k wrote:

I would guess it might be a fluidics issues. You should be looking at all the SAV/BaseSpace plots to get a better idea.

Regardless, I would call Illumina. If there is any sequencing-related problem, it's easiest to call them. They may not solve your issue, but they will eliminate a lot of possibilities.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by igor7.6k

It is bad practice by the lab that did sequencing if they knowingly released data from a run that had hardware issues. No one should do this.

That said this may have been conveyed to @Frymor. We should wait to get clarification from @Frymor before drawing any conclusions.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax65k

I don't think they knowingly released bad data, but they may not have noticed that anything was wrong. Sure, the quality was lower than normal, but it's not a complete failure.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by igor7.6k

I am sure this was not knowingly done. Our seq. centre did not check the data quality, as they left it for me to do. We do need to change it in the future. Mainly to save time, but also for them to be able to give only data sets with good quality.

And as @Brian said, the data quality is not a total bust. One can still work with the data and all in all, it is a start. But it would be best to try and understand and solve the problem.

ADD REPLYlink written 2.9 years ago by Assa Yeroslaviz1.2k

If one is a core facility then this is not how you should be looking at the situation. If you were the customer and if someone else gave you bad data, I don't think you will appreciate that down the road.

This looks like this is a new in-house instrument and you are the data custodian/guardian of quality :-) You should start getting a QC plan together. Use the NextSeq Spec Sheet as a starting point to decide what a good data set should like (in terms of yield/quality) and what you would consider as less than optimal run.

A lot depends on what happens on the experimental side of things that determines ultimate quality of data (as you have already discovered, getting library yields right requires careful concentration/insert size estimation) so expect this process to take a few months before things settle down.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax65k

Hi all and thanks a lot for the answers, ideas and comments. I think it would be best to explain a bit more, before going into the details of your comments. Our institute is just starting now with sequencing (until now we always out-sourced it). We bought a NextSeq 500 (@genomax2 is correct :D ). I am not doing the sequnecing, but as the institutes' bioinformatician am responsible for analyzing the data coming out of the machine. The samples were barcoded and multiplexed. Until now we didn't do so many runs, but the two runs I have got so far show exactly the same problem. the R1 samples show better quality results than R2. Another problem occurring in the sequencing runs is the fluctuations in the library sizes. We aimed for 20M reads per sample, but we got some with almost 30M and some with at high as 3M reads.

ADD REPLYlink written 2.9 years ago by Assa Yeroslaviz1.2k

Maybe you have a faulty instrument. As I mentioned previously, call Illumina and this can be resolved in 15 minutes. They can also access the instrument to get all the necessary data, which none of us can.

Also, learn about SAV and/or BaseSpace. Those are crucial in troubleshooting sequencing issues.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by igor7.6k
0
gravatar for Brian Bushnell
2.9 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Read 2 is usually lower quality than read 1. But in this case, for chip-seq, why does it even matter? You have sufficient quality for mapping, assuming the quality scores are accurate (hint: they aren't). You could do the experiment with just read 1 anyway, and even a low-quality read 2 will just increase the accuracy of the mapping. So, don't worry about it for this experiment.

P.S. Have you measured your library diversity?

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Brian Bushnell16k

Quality scores are fairly accurate for current Illumina instruments. If you do base score quality recalibration, the difference is minimal.

However, the quality is not the main issue. It's just one piece of information we have. Something clearly went wrong. It's important to figure out what. The sequences could be biased.

ADD REPLYlink written 2.9 years ago by igor7.6k

It looks to me like a fluidics issue. And certainly it's prudent to fix such things as soon as possible, but I don't see any reason to think the experiment was compromised.

ADD REPLYlink written 2.9 years ago by Brian Bushnell16k
1

It would be interesting to see the rest of the fastQC report, to rule out sequence-bias-possibilities.

ADD REPLYlink written 2.9 years ago by WouterDeCoster38k
2

If this data is from a run that had hardware issues I would not spend any more time on this and insist that the lab re-run this sample.

Illumina replaces sequencing kits for free when there is a hardware issue (provided the lab has a maintenance contract) so the only thing the lab (and you) are out of is time.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax65k

what part of the fastqc report do you want to see? Unfortunately i am not sure what to look for.

If interested, you can download the fastqc reports from the following folder: In there are the reports from the two samples i have posted above. These reports are from the concatenated fastq files.

As mentioned above, the files were multiplexed and sequenced on four different lanes on the flow cell. I than got four different fastq files for each samples and pair. This I have concatenated before proceeding. If you think it would be best to see the fastqc reports of the original not concatenated fastq files, let me know and i will upload them as well. thanks Assa

ADD REPLYlink written 2.9 years ago by Assa Yeroslaviz1.2k

What does it means, a fluidics issue? is it a technical problem with the sequencer?

I wasn't sure it is the sequencer, as it sees very specific to the R2 samples, which make me think it has something to do with the library preparation. But as I am not an expert, I can't really pin-point it to a specific step in the preparation.

ADD REPLYlink written 2.9 years ago by Assa Yeroslaviz1.2k

It is matter to us, because we need to know what causes this problem. At the moment we are just starting up, but if the problem continue and also happens in other paired-end experiments (such as rna-seq), we can get some artefacts or mismapped reads. It is in general not such a good idea IMHO to work with data with knowingly not as good a quality as you can get. The problem with the low-quality read2 is that a lot of the reads are not mapped as pairs.

How do I calculate the library diversity? Do you mean library complexity? for the first data set, I have run preseq for the raw file before and after removing the low-quality and duplicated reads. I have also run dupRadar to calculate the complexity of the libraries.

below you can the summary plot of preseq: preseq

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Assa Yeroslaviz1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2054 users visited in the last hour