Analysis of NGS data from a Long-PCR product
1
0
Entering edit mode
9.5 years ago
dra.explota ▴ 40

Hello everyone!

I am faced with something I haven't tried before, and I am unsure how to proceed.

I am working with a Long-PCR product, which has been sequenced on a MiSeq instrument with a PCR-Free kit.

The amplicon is about 15 KB and with regions of high homology to other areas of the genome, so aligning to the whole genome resulted in a big mess.

I am currently trying to align only to my 15 KB reference (I made a custom fasta file of my region of interest) with Bowtie2, and I have used Samtools/Picard and called variants with both GATK and Samtools to compare. There are a few differences, but I think my biggest problem is some Indels. GATK calls no deletions, while Samtools calls deletions but few insertions. So, I was trying to use the local realigment of GATK but I get a blank file. Of course, I could feed it the known Indels file (golden_indels.vcf), but! As I am only doing a targeted alignment, of course the coordinates are incorrect. Is there a way to get around this? (I have been converting my coordinates before annotating VCF files, but I am thinking there must be a better way?)

I was thinking, could I align to the chromosome of interest and specify a region? -I can't seem to find such an option (there is for GATK to call variants only in a certain region, but not for alignment with Bowtie)

I am also interested to know if has anyone tried analyzing data from a a similar set-up:

What are the caveats

Which workflow would you suggest?

What has been your experience with analyzing NGS results for Long-PCR products in general?

How did the mapping/coverage/quality/duplications look?

I know it is a lot to ask, so any partial answer is also very much appreciated.

Best regards,
LR

Bowtie2 next-gen GATK PCR-Free Long-PCR • 2.8k views
ADD COMMENT
2
Entering edit mode
9.5 years ago

Bowtie2 can't be told to only align to a specific part of the genome. If you want to just align to that part and keep the coordinates of the resulting alignments correct, then you'll have to hard-mask (i.e., replace the sequence with Ns) all the sequence not covered by your amplicon. This is probably easier to do with a single chromosome (then you can (1) write a bunch of Ns (2) use samtools faidx to get the amplicon reference sequence and then (3) write more Ns until the sequence is the right length).

I've not done this, so I can't give any caveats aside from you needing to be sure that the cleaned PCR product is really only of the target region (otherwise, you'll get false-positive alignments and wrong variant calls!).

ADD COMMENT
1
Entering edit mode

Thank you for your answer! Yes, I was considering this, but wasn't sure if it was the right thing to do. And my PCR product, well, it is probably not 100% specific, which was apparent when the alignment was made to the whole genome. But once I align only to my target region, the unspecific calls were very much reduced (I am doing Sanger also, to compare).

ADD REPLY
1
Entering edit mode

Seems like a reasonable approach. If you're not already doing so, you might be a bit more stringent than normal on filtering by MAPQ. That should further decrease the false positive rate due to off-target reads misaligning to the amplicon area. I guess the Sanger sequencing will end up telling you how well this ended up working :)

ADD REPLY
1
Entering edit mode

Thanks for the tip! My alignment has a very high depth, so throwing some stuff out is not a big deal :)

I did throw out duplicates, although it is a PCR-free Sequencing kit, not sure if I am just throwing data away for some reason, or if it makes any improvement.

PS. I am very thankful for your comments, but I hope it is ok I leave the subject open and see if anyone else has any input.

ADD REPLY
2
Entering edit mode

With higher depth, it's likely that PCR-duplicates aren't actually duplicates (especially if you only have single-end reads). But marking them is unlikely to hurt things in your case unless you're looking for rare changes in a heterogenous population (e.g., complex cancer samples).

Absolutely keep the thread open! The more replies the merrier :)

ADD REPLY

Login before adding your answer.

Traffic: 2017 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6