Question

Placing SNPs within updated genome assembly

0

Entering edit mode

4.7 years ago

pigsnstuff • 0

Hi everyone,

Have a millions of SNPs that were called using sequences aligned against an early version of a genome assembly.

My task is to figure out the position of these SNPs within a newer version of the assembly, and compare SNPs called using the old assembly to SNPs called using the newer assembly.

I realise this is unlikely to capture everything, but what I’ve done so far is BLAST the SNPs using their flanking sequences (500bp either side of the variant = 1001 bp query) against a custom database using the following options:

blastn -db ${db_folder}/Db -query $blast_input -outfmt 6 -perc_identity 95 -num_threads 4 -out ${blast_input}.out

After filtering out the short read matches I’m fairly happy with the output (i.e. I don’t have thousands of hits per query), but placing the SNPs within the new assembly with any degree of accuracy has me stumped.

Example output:

query1  chrom6  98.312  948     4       10      1       939     404     1348    0.0     1652

The length of qstart/qend doesn’t align with sstart/send, due to the mismatches and gap openings.

Is there any way of deducing from this information alone where I can expect the SNP to be in the subject assembly?

Alternatively, is there a way of forcing BLAST to only compare the entire sequence length, but allow some mismatches?

For instance, I tried using the -word_size option with -perc_identity set to 95, however it only returned 100% hits.

I used vmatch on a much smaller dataset and found the output of that programme easier to interpret, as it compared the query and subject lengths on a like-for-like basis, but with mismatches allowed, however it’s very slow compared to BLAST.

I realise I could also search the SNPs called using the newer assembly for the regions given in the BLAST output, however that often returns several candidate SNPs and I'd like to narrow the search as much as possible.

Any advice would be much appreciated!

BLAST vmatch snps genome assembly • 984 views

ADD COMMENT • link 4.7 years ago by pigsnstuff • 0

1

Entering edit mode

Is this a common genome, such as human/mouse for which liftOver chain files exist? https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/picard_vcf_LiftoverVcf.php

ADD REPLY • link 4.7 years ago by ATpoint 81k

0

Entering edit mode

bwa-sw is another tried and tested option for a SNP remapping approach. Warning - remapping SNVs and their flanks is a lossy procedure.

I have I believe read a couple of papers suggesting only a complete remapping of reads will lead to the desired result. Is there some reason you can't do this ? SNVs from the biased old reference sequence which were not found back then cannot obviously be remapped.

ADD REPLY • link 4.7 years ago by colindaven 6.3k

0

Entering edit mode

Oh dear, this sounds a bit beyond my experience for the time frame I have (days). Can you give me any pointers on where to start?