Question

Novaseq read coverage needed for demultiplexing

0

Entering edit mode

20 months ago

cassie.bishop • 0

Hi,

I am a new graduate student in biology and am relatively new to sequencing in general. I am planning on doing a genome-wide CRISPR screen over many days. I plan to extract genomic DNA from each condition and amplify the sg region with primers that include the illumina adaptors and a barcode on the reverse primer. I will pool all of these barcoded samples together and run them via NovaSeq paired end 100bp sequencing. The barcode is most certainly within the first 100 bp of the reverse read and with paired end I should be able to tell which forward read it corresponds to.

I recently sent the library off for sequencing to determine representation of each sgRNA in the library using the exact same sequence parameters. Unfortunately, most of the primer (except the part that annealed to the backbone of the plasmid) including barcode was not present in the reverse sequence. However, I told the core that sequenced my samples which index I used and that index was in the information line of each record in the fastq file. The core informed me they did no preprocessing of the reads.

So my question— is there any way for the illumina sequencing machine to know which index is present if it isn’t present in the read? Also, my PCR product size corresponds with the whole p7 primer being present in the product, so why isn’t some of the reverse primer present in the sequence reads? Do I need to increase the size of reads when I sequence my screen in order to demultiplex?

Thanks in advance for any help!

barcode NGS demultiplexing sequencing i7 • 1.2k views

ADD COMMENT • link updated 20 months ago by Brian Bushnell 20k • written 20 months ago by cassie.bishop • 0

0

Entering edit mode

is there any way for the illumina sequencing machine to know which index is present if it isn’t present in the read?

No there isn't that is if the data is not in the main read/index reads. Sounds like you did not use standard Illumina indexing scheme where the indexes are read as separate reads (they are not part of the main sequence for any type of run).

Had you got your constructs validated from your sequencing core/someone else who was knowledgeable about Illumina sequencing before you went through with the experiment?

It may be possible to salvage this data but we would need to know specific details about your construct, locations of the indexes, sequencing primer and how the sequencing was done.

ADD REPLY • link 20 months ago by GenoMax 152k

0

Entering edit mode

Yes, the constructs were validated— I got them from addgene. The backbone is the commonly used lenti guide puro plasmid with sgRNAs inserted at known relative levels based on previous validation. I amplified the library and was trying to determine the new distribution of sgRNAs within the library. I am pretty sure the primers and scheme we are using are widely used (we used a protocol published on addgene and from the Broad Institute). I only received the fastq files, no index sequence files. A screenshot of the P5 and P7 primers I used is below. The final PCR product is 354 bp and each p5 and p7 primer is about 100 bp. enter image description here

ADD REPLY • link 20 months ago by cassie.bishop • 0

1

Entering edit mode

Good to know that this is not a full custom design but a standard commercial protocol. Based on the info above you know where the sequencing primer is so your reads should start at base immediately after the sequencing primer site. Once you confirm that is the case you could then separate your reads using the stagger sequences (you should be able to use bbduk.sh from BBTools in filter mode).

Show us a couple of sequence examples from R1 and R2. Also a link for the broad/addgene protocol may be helpful.

ADD REPLY • link 20 months ago by GenoMax 152k

score 0 · Answer 1 · 2023-11-19

Actually, in this case I'd recommend using Seal (also in BBTools), so it can all be done in one pass:

#Trim the primers so you only get the part where a 31-mer spans the variable junction
reformat.sh in=primers.fa out=trimmed_primers.fa ftl=34

#Demultiplex the reads containing the primer
seal.sh in=r#.fq ref=trimmed_primers.fa k=31 hdist=1 pattern=out_%_r#.fastq outu=unmatched_r#.fastq

These commands assume you have everything in twin files named r1.fastq and r2.fastq; Seal replaces the # symbol with a 1 and 2. The % symbol gets replaced by the primer sequence name, so you would get output files like:

P5_0nt_stagger_r1.fastq
P5_0nt_stagger_r2.fastq
P5_1nt_stagger_r1.fastq
P5_1nt_stagger_r2.fastq

...etc. Now as for why your primers are not showing up in all of your reads, I don't know... you might want to try generating an insert-size histogram by merging, if you expect the paired reads to overlap:

bbmerge.sh in=r#.fastq ihist=ihist.txt

If I'm understanding your protocol correctly (which is doubtful) you are expecting an insert of 154bp (354-100-100).