Question: How to center SNP position from a tag pair and add flanking sequence using reference genome
0
gravatar for grmunjal
3.4 years ago by
grmunjal0
United States
grmunjal0 wrote:

I have a fasta file containing tag pairs (63 identical nucleotides + 1 SNP) like so:

>TP2_query

CAGCAAAAAAAAAAAAAAAAAAATGAAACTTCTGAGAGGGAAAAATTACTCAACCTTATATTCT

>TP2_hit

CAGCAAAAAAAAAAAAAAAAAAATGTAACTTCTGAGAGGGAAAAATTACTCAACCTTATATTCT

I would like to produce 1 sequence per tag pair that is centered on the SNP (can be either allele) and flanked by 50 bp (coming from the reference genome) on either side. I'm thinking of this as a two step process:

(1) to determine the SNP position in the tag pair and the corresponding coordinates in the reference genome

(2) adding flanking sequence from the reference using the coordinates from (1)

Any ideas about (1) would be greatly appreciated.

I found bedtools flank is able to do (2) in the easiest possible way (maybe samtools faidx too). The command needs a "genome" file for the species (Medicago in my case). If I understand correctly, this file is simply tab delimited maximum numeric values of coordinates for each chromosome? I can figure out how to do this with some cheap python or R code but wondering if there's a neat unix solution for this.

snp • 1.3k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by grmunjal0
0
gravatar for grmunjal
3.4 years ago by
grmunjal0
United States
grmunjal0 wrote:

Figured out (2) using http://kevin-gattaca.blogspot.com/2011/07/genomecoveragebed-to-look-at-coverage.html and discussion here https://groups.google.com/forum/#!topic/bedtools-discuss/E4spPI5rBQ8

ADD COMMENTlink written 3.4 years ago by grmunjal0
0
gravatar for grmunjal
3.4 years ago by
grmunjal0
United States
grmunjal0 wrote:

For 1 thinking of adding mapping start position coordinate to within read position using something like (from https://github.com/enormandeau/Scripts/blob/master/uneak_locate_snp_positions.py ):

def find_snp_position(s1, s2):
  assert len(s1) == len(s2), "Sequences have different lengths"
  snps = []
  for i in xrange(len(s1)):
  if s1[i] != s2[i]:
  snps.append(i + 1)
  assert len(snps) > 0, "No SNP found"
  assert len(snps) <= 1, "More than one SNP found"
  return snps[0]

 

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by grmunjal0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2502 users visited in the last hour