Hello everyone!
I am faced with something I haven't tried before, and I am unsure how to proceed.
I am working with a Long-PCR product, which has been sequenced on a MiSeq instrument with a PCR-Free kit.
The amplicon is about 15 KB and with regions of high homology to other areas of the genome, so aligning to the whole genome resulted in a big mess.
I am currently trying to align only to my 15 KB reference (I made a custom fasta file of my region of interest) with Bowtie2, and I have used Samtools/Picard and called variants with both GATK and Samtools to compare. There are a few differences, but I think my biggest problem is some Indels. GATK calls no deletions, while Samtools calls deletions but few insertions. So, I was trying to use the local realigment of GATK but I get a blank file. Of course, I could feed it the known Indels file (golden_indels.vcf), but! As I am only doing a targeted alignment, of course the coordinates are incorrect. Is there a way to get around this? (I have been converting my coordinates before annotating VCF files, but I am thinking there must be a better way?)
I was thinking, could I align to the chromosome of interest and specify a region? -I can't seem to find such an option (there is for GATK to call variants only in a certain region, but not for alignment with Bowtie)
I am also interested to know if has anyone tried analyzing data from a a similar set-up:
What are the caveats
Which workflow would you suggest?
What has been your experience with analyzing NGS results for Long-PCR products in general?
How did the mapping/coverage/quality/duplications look?
I know it is a lot to ask, so any partial answer is also very much appreciated.
Best regards,
LR
Thank you for your answer! Yes, I was considering this, but wasn't sure if it was the right thing to do. And my PCR product, well, it is probably not 100% specific, which was apparent when the alignment was made to the whole genome. But once I align only to my target region, the unspecific calls were very much reduced (I am doing Sanger also, to compare).
Seems like a reasonable approach. If you're not already doing so, you might be a bit more stringent than normal on filtering by MAPQ. That should further decrease the false positive rate due to off-target reads misaligning to the amplicon area. I guess the Sanger sequencing will end up telling you how well this ended up working :)
Thanks for the tip! My alignment has a very high depth, so throwing some stuff out is not a big deal :)
I did throw out duplicates, although it is a PCR-free Sequencing kit, not sure if I am just throwing data away for some reason, or if it makes any improvement.
PS. I am very thankful for your comments, but I hope it is ok I leave the subject open and see if anyone else has any input.
With higher depth, it's likely that PCR-duplicates aren't actually duplicates (especially if you only have single-end reads). But marking them is unlikely to hurt things in your case unless you're looking for rare changes in a heterogenous population (e.g., complex cancer samples).
Absolutely keep the thread open! The more replies the merrier :)