Question

Aligning short NGS reads to multiple (100,000) references

0

Entering edit mode

2.3 years ago

x_ma_x • 0

Hi all. Apologies if my question is basic or my terminology isn't 100% right as I'm fairly new to the world of NGS.

Basically, I will have a library of, let's say 100,000 clones, each differing by a 100bp sequence. This 100bp section of the library will be randomly mutagenized to have a single random mutation (obviously this won't be possible and many clones will remain WT, and many will have more than one mutation).

What I'll then need to do is run this on MiSeq, and analyse it in a way that I know both the coverage of the new mutated library, as well as what percentage of the new library is mutated to the extent that I want (ie. a single mutation). I am not sure what tool I can use to align a fastq to 100,000 references, not to mention any analysis down the line.

Thanks!

alignment miseq reference • 1.2k views

ADD COMMENT • link updated 2.3 years ago by GenoMax 141k • written 2.3 years ago by x_ma_x • 0

0

Entering edit mode

Sounds like a classical mutation calling workflow, just use a single reference with your insert and perform mutation calling for each sample. You won't know the reference until you sequenced them anyway, right?

ADD REPLY • link 2.3 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

I will actually know the reference - the 100,000 inserts are cloned oligos with specific sequences.

Not sure I understand the other part though - as far as I understand, mutation calling works for a single reference but in my case, there will be a 100,000 short reference sequences. The whole sequencing will be done with just one sample - the full library PCRed with a single set of MiSeq adapters.

ADD REPLY • link 2.3 years ago by x_ma_x • 0

score 0 · Answer 1 · 2022-01-17

0

Entering edit mode

2.3 years ago

GenoMax 141k

Even though MiSeq is a sequencing champ and can sequence difficult templates this sort of low-nucleotide diversity sample can be a potential run killer. Be sure to include an ample amount of phiX spike at run time.

Once you get the data you may need to do this a different way. I suggest the following:

Find and remove the common sequence (use bbduk.sh to trim) keep just 100 bp (unless you have a need to keep the data i.e. there is a barcode in them). If you are going to use paired-end reads then you may want to merge them first using bbmerge.sh.
Use a program like clumpify.sh from BBMap suite (to work with fastq data) or CD-HIT (if you convert the sequences to fasta) and reduce the dataset to unique representatives.

You could then do a multiple sequence alignment against the reference to find mutations or do an alignment against the reference.

ADD COMMENT • link 2.3 years ago by GenoMax 141k

0

Entering edit mode

Why would a library of 100,000 different sequences be considered low nucleotide diversity? Those sequences will all be completely different, so I expect approximately equal distribution of 4 bases at each cycle. I actually did some trial runs already (to optimise the random mutagenesis process) with just a single plasmid as PCR template - in some cases, 95% of the sequences where exactly the same, but overall the sequencing managed to run fine (that was with 10% PhiX).

As for the tips, step 1 of course makes sense, but not sure step 2 will be a good idea in my case - leaving only unique representatives will not allow me to do any sort of quantitative analysis of each mutation. Might try either way.

What I still don't understand however, how can I easily do an alignment against 100,000 reference sequences at once?

ADD REPLY • link 2.3 years ago by x_ma_x • 0

0

Entering edit mode

100,000 different sequences be considered low nucleotide diversity?

You had said that each sequence differs by 100 bp that indicates that there is a common leading or training sequence? I had imagined this section to be segregated in the read and thus my comment was with reference to that. If you have done a test run then that is obviously not applicable in your case.

clumpify.sh has an option to count the reads of each type (i.e anything that differs at least one bp or more). You can leave a single representative with count added to header or simply clump the sequences. That would be your choice.

Aligners mostly should not care if you have one or 100K references. As long as sequences have unique headers and you build the aligner index you should be able to align the data.

ADD REPLY • link 2.3 years ago by GenoMax 141k

0

Entering edit mode

Ah my bad, I was not thinking straight when I made that reply! Indeed the sequences have universal primer handles, so yes, about 25% of the sequence will be common to every item in the library (minus the random mutations). Of course, I also have to make the read a bit longer than 100 cycles to account for possible insertions.

By the way, what kind of PhiX spike should I be aiming for in this case? As I mentioned I used 10% in my trial experiments and it worked reasonably well, but maybe that's too low?

Thanks for the helpful tips, will try to have a go once I have the data next week.