how to extract REF and ALT DNA string around SNV
1
0
Entering edit mode
23 months ago

Hi, everyone

I want to reproduce the operation about Motif identification at positions of SNVs, which is the part of "The chromatin accessibility landscape of primary human cancers"

The detailed method is below:

To identify motifs whose binding affinity is predicted to be most perturbed by a given SNV, we constructed a REF and VAR string (+/- 10 bp from the SNV, 21 bp in length) using the hg38 BSGenome reference sequence. We then used motifmatchr with CIS-BP motifs to calculate the motif positions and motif scores (representing the significance of the motif match) using motifmatchr “matchMotifs(positions = “out”, p.cutoff=0.01)” in both the REF and VAR string. Then, we intersected the motif positions with the SNV in both the REF and VAR string to ensure the motifs overlap the SNV. For each motif present, the motif’s score differential was computed if the motif was present at the same position in both the REF and VAR. If there was no motif present at the same position in both the REF and VAR, the differential was compared to a motif score of “0”.

For now, I just wan to know how to constructed a REF and VAR string (+/- 10 bp from the SNV, 21 bp in length). I try to use injectSNPs from BSgenome, but it need the SNPlocs-class, which is hard to get for me. So I am wondering whether some one can give me some other advices.

PS: the bioconductor support link I posted about SNPlocs-class is here

Best wishes

Guandong Shang

SNV motif • 926 views
1
Entering edit mode

You may be able to use GATK FastaAlternateReferenceMaker (LINK) for this.

0
Entering edit mode

Thanks! this look good :) I will try it.

2
Entering edit mode
23 months ago
David Parry ▴ 120

With a simple python script:

import sys
import pysam
from pyfaidx import Fasta

def main(vcf, fasta):
fa = Fasta(fasta, as_raw=True)
with pysam.VariantFile(vcf) as variants:
for record in variants:
if len(record.alleles) != 2: continue
if len(record.ref) != 1 or len(record.alts[0]) != 1: continue
ref_string = fa[record.chrom][record.start - 10:record.stop + 10]
alt_string = ref_string[:10] + record.alts[0] + ref_string[11:]
print("\t".join((record.chrom, str(record.pos), ref_string, alt_string)))

if __name__ == '__main__':
if len(sys.argv) != 3:
sys.exit("Usage: {} input.vcf reference.fa".format(sys.argv[0]))
main(*sys.argv[1:])


You will need to install pysam and pyfaidx (e.g. python3 -m pip install pysam pyfaidx). The output gives you chromosome, snp position, reference and alternate strings separated by tabs.

0
Entering edit mode

Thanks! I will try it.