Hi,
I received a fastq file, long range PCR, long read sequencing (read length up to 64kb). The longest reads seem to be two or more amplicons that were ligated during library prep. I'm looking for an efficient way to scan every read in the fastq and split reads at the ligation points and build a new file containing single amplicons. Single amplicons look like this:
index1-Fprimer-sequence-Rprimer-index2,
so ligated amplicons look something like:
index1-primer-sequence-primer-index2-index3-primer-sequence-primer-index4-
(and so on depending on number of ligated amplicons). Primers are the same throughout the fastq (1 gene, long read sequencing over the whole gene), but multiple samples were pooled, hence the different indexes. The output file doesn't necessarily have to bin them according to sample, although that would be nice. I can always run minibar demultiplexing on the output file, once the amplicons are separated.
Does anyone know if there's a tool available that can do this in a efficient manner or do I write something myself. If so, any ideas on strategy? I was thinking of making a list of possible index-primer combinations (allowing for mismatches of course) and using str.split or something along that line. Any help appreciated. :)