Question

Extracting high quality reads from partially failed 10X single cell RNA-, ATAC-seq sequencing run

0

Entering edit mode

5.4 years ago

camerond ▴ 190

I seek advice as to the best strategy to salvage high quality reads from a 10X single cell RNA-, and ATAC-seq experiment that partially failed on the Hiseq-4000.

The issue is mainly with the scRNA-seq data.

On the flow cell we ran 5 lanes for RNA and 3 lanes for ATAC. The failure occurred as each modality requires a different run configuration due to indexing differences between them, scRNA-seq uses a single index, whereas scATAC-seq is dual index, and (we now know!) 10X do not recommended to mix single and dual indexed samples on the same flow cell.

Hindsight aside, we ran with dual index parameters, which resulted in the ATAC-seq data looking great, but for the RNA lanes, the quality scores for the reads on all of top half of the flow cell were abysmal. Why? Although 10X themselves have not been able to replicate this issue in house, this appears to be caused by a loss of focus on the upper surface of the flow cell after the i5 read. Others have mentioned the same issue elsewhere.

Our current strategy to salvage the high quality reads is to extract raw data from the sequencer from the good half of one of the RNA-seq lanes and create a new fastq file, to see if cell ranger like this, but I'm wondering - is the best strategy? - or is there a way to extract the reads with high quality scores from the fastq files that we have already generated?

If the answer is the latter, I'm unsure how to do this considering the forward and reverse reads for single cell data contain different information. This may be a trivial issue.

Any advice on this issue would be greatly appreciated.

---- Edit in response to ATpoint ----

As you can see the problem is with R2 reads, rather than R1.

Read 1:

@K00267:334:HFH3JBBXY:1:1101:1164:1156 1:N:0:AACCGGAA
NGAGAAGGTTACGATCACCTGGAAGGTC
+
#AAF-FAJJJJFFJJAJF7AAJ7F-<JJ
@K00267:334:HFH3JBBXY:1:1101:1225:1156 1:N:0:GGTTTACT
NGAGCAGGTTGCATCAAGCTGTCCGCCA
+
#AAA7FFJJJJJAJJFFJJJJJJJJJJF
@K00267:334:HFH3JBBXY:1:1101:1265:1156 1:N:0:TCGGCGTC
NGGATGTCAGCTACATATTGACCGTCTT
+
#AAAFJJJJJJJJJJJJJJJJJJJAFFF
@K00267:334:HFH3JBBXY:1:1101:1326:1156 1:N:0:AACCGAAA
NTCTCTAAGCATTTGCAAGCTGTAAGAC
+ 
#AAAFJFJJJJJJFFAJJFJJJJJJJJF

Read 2:

@K00267:334:HFH3JBBXY:1:1101:1164:1156 2:N:0:AACCGGAA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
###########################################################################################
@K00267:334:HFH3JBBXY:1:1101:1225:1156 2:N:0:GGTTTACT
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
###########################################################################################
@K00267:334:HFH3JBBXY:1:1101:1265:1156 2:N:0:TCGGCGTC
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
###########################################################################################
@K00267:334:HFH3JBBXY:1:1101:1326:1156 2:N:0:AACCGAAA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
###########################################################################################

single cell RNA-seq 10X sequencing • 2.3k views

ADD COMMENT • link 5.4 years ago by camerond ▴ 190

1

Entering edit mode

You should contact Illumina rather than 10x, they may refund part of the cost. We routinely mix dual and single barcodes on the same flowcells (including with 10x samples) and don't have these sorts of issues.

ADD REPLY • link 5.4 years ago by Devon Ryan 105k

0

Entering edit mode

We have contacted them, but unfortunately our service contract with the Hiseq 4000, has just ended (we are moving over to the Novaseq for future runs), so I'm not sure we'll get, or whether it's worth getting, anything back from them. We were running this experiment as a test on the samples to see if they are good enough for further experiments. It's interesting that you haven't seen this problem before, particularly as 10X themselves can't replicate it either. It may be a combination of factors that cause this (dodgy flow cell, dodgy reagents, single/dual indexing etc.). Just out of interest what run configuration do you run with if you are running dual and single index samples through the same flow cell? Have you ran 10X scRNA and scATAC on the flow cell before?

ADD REPLY • link 5.4 years ago by camerond ▴ 190

1

Entering edit mode

Ah, losing the service contract limits things a bit. I don't know the exact settings our sequencing core used. We've only recently had scATAC running, so I don't think it was mixed with scRNA-seq.

ADD REPLY • link 5.4 years ago by Devon Ryan 105k

0

Entering edit mode

No probs. Many Thanks.

ADD REPLY • link 5.4 years ago by camerond ▴ 190

score 3 · Accepted Answer · 2020-02-28

3

Entering edit mode

5.4 years ago

ATpoint 88k

You can filter for mean base quality, this is probably the simplest option, see for example How to extract reads passing a threshold for Mean Sequence Quality. If you do that in paired-end mode (you should do that obviously to keep R1 and R2 synchronized) be sure to first clip off the bases in R1 that are beyond the barcode/UMI part, so (if this is V3 chemistry) everything beyond the first 28bp to avoid low base quality means due to the trashy part that comes after the BC/UMIs.

ADD COMMENT • link 5.4 years ago by ATpoint 88k

0

Entering edit mode

Many Thanks @ATpoint. I will try this on Monday. The problem is actually with the R2 reads. The R1 reads are consistently 28bp long. I tried running the files through cell ranger to see what would happen but, in their current state, the software does not recognise the 10X chemistry used in the experiment - even if I state this explicitly (it is V3 btw).

ADD REPLY • link 5.4 years ago by camerond ▴ 190

1

Entering edit mode

If all R2 reads have N's in them then there is nothing you can do to salvage this data. You are going to need to re-run these samples per 10x recommendations.

If all reads don't have N's then you could filter the reads using bbduk.sh from BBMap suite. Set maxns=Number to remove all reads that contain N > number.

ADD REPLY • link 5.4 years ago by GenoMax 152k

0

Entering edit mode

I'm hoping only half the reads will have Ns in them. I'll def filter these out first. Many thanks for the suggestion.

ADD REPLY • link 5.4 years ago by camerond ▴ 190