Question: Placing SNPs within updated genome assembly
gravatar for pigsnstuff
6 months ago by
pigsnstuff0 wrote:

Hi everyone,

Have a millions of SNPs that were called using sequences aligned against an early version of a genome assembly.

My task is to figure out the position of these SNPs within a newer version of the assembly, and compare SNPs called using the old assembly to SNPs called using the newer assembly.

I realise this is unlikely to capture everything, but what I’ve done so far is BLAST the SNPs using their flanking sequences (500bp either side of the variant = 1001 bp query) against a custom database using the following options:

blastn -db ${db_folder}/Db -query $blast_input -outfmt 6 -perc_identity 95 -num_threads 4 -out ${blast_input}.out

After filtering out the short read matches I’m fairly happy with the output (i.e. I don’t have thousands of hits per query), but placing the SNPs within the new assembly with any degree of accuracy has me stumped.

Example output:

query1  chrom6  98.312  948     4       10      1       939     404     1348    0.0     1652

The length of qstart/qend doesn’t align with sstart/send, due to the mismatches and gap openings.

Is there any way of deducing from this information alone where I can expect the SNP to be in the subject assembly?

Alternatively, is there a way of forcing BLAST to only compare the entire sequence length, but allow some mismatches?

For instance, I tried using the -word_size option with -perc_identity set to 95, however it only returned 100% hits.

I used vmatch on a much smaller dataset and found the output of that programme easier to interpret, as it compared the query and subject lengths on a like-for-like basis, but with mismatches allowed, however it’s very slow compared to BLAST.

I realise I could also search the SNPs called using the newer assembly for the regions given in the BLAST output, however that often returns several candidate SNPs and I'd like to narrow the search as much as possible.

Any advice would be much appreciated!

vmatch blast snps assembly genome • 177 views
ADD COMMENTlink written 6 months ago by pigsnstuff0

Is this a common genome, such as human/mouse for which liftOver chain files exist?

ADD REPLYlink written 6 months ago by ATpoint28k

bwa-sw is another tried and tested option for a SNP remapping approach. Warning - remapping SNVs and their flanks is a lossy procedure.

I have I believe read a couple of papers suggesting only a complete remapping of reads will lead to the desired result. Is there some reason you can't do this ? SNVs from the biased old reference sequence which were not found back then cannot obviously be remapped.

ADD REPLYlink written 6 months ago by colindaven1.9k

Oh dear, this sounds a bit beyond my experience for the time frame I have (days). Can you give me any pointers on where to start?

ADD REPLYlink written 6 months ago by pigsnstuff0

Unfortunately not, and the newer version isn't publicly available yet.

ADD REPLYlink written 6 months ago by pigsnstuff0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 652 users visited in the last hour