Hi everyone,
Have a millions of SNPs that were called using sequences aligned against an early version of a genome assembly.
My task is to figure out the position of these SNPs within a newer version of the assembly, and compare SNPs called using the old assembly to SNPs called using the newer assembly.
I realise this is unlikely to capture everything, but what I’ve done so far is BLAST the SNPs using their flanking sequences (500bp either side of the variant = 1001 bp query) against a custom database using the following options:
blastn -db ${db_folder}/Db -query $blast_input -outfmt 6 -perc_identity 95 -num_threads 4 -out ${blast_input}.out
After filtering out the short read matches I’m fairly happy with the output (i.e. I don’t have thousands of hits per query), but placing the SNPs within the new assembly with any degree of accuracy has me stumped.
Example output:
query1 chrom6 98.312 948 4 10 1 939 404 1348 0.0 1652
The length of qstart/qend doesn’t align with sstart/send, due to the mismatches and gap openings.
Is there any way of deducing from this information alone where I can expect the SNP to be in the subject assembly?
Alternatively, is there a way of forcing BLAST to only compare the entire sequence length, but allow some mismatches?
For instance, I tried using the -word_size option with -perc_identity set to 95, however it only returned 100% hits.
I used vmatch on a much smaller dataset and found the output of that programme easier to interpret, as it compared the query and subject lengths on a like-for-like basis, but with mismatches allowed, however it’s very slow compared to BLAST.
I realise I could also search the SNPs called using the newer assembly for the regions given in the BLAST output, however that often returns several candidate SNPs and I'd like to narrow the search as much as possible.
Any advice would be much appreciated!
Is this a common genome, such as human/mouse for which liftOver chain files exist? https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/picard_vcf_LiftoverVcf.php
bwa-sw is another tried and tested option for a SNP remapping approach. Warning - remapping SNVs and their flanks is a lossy procedure.
I have I believe read a couple of papers suggesting only a complete remapping of reads will lead to the desired result. Is there some reason you can't do this ? SNVs from the biased old reference sequence which were not found back then cannot obviously be remapped.
Oh dear, this sounds a bit beyond my experience for the time frame I have (days). Can you give me any pointers on where to start?
Unfortunately not, and the newer version isn't publicly available yet.