Question: Getting BasePairs/SNPs/Variants from Chromosome Positions
gravatar for dnlsmy
4.2 years ago by
United States
dnlsmy10 wrote:

Hi, I need some help automating and speeding up my data analysis. 

For now, I track variants from BAM files using mpileup (samtools) and obtain a CSV (converted from VCF using BCFTools) with a structure like this:


So it looks like this 



What I want to do is add two more fields, one for CONTEXT (which would give me the BP before and after the original chromosome position) and whether or not the original BP is a Common SNP.

So it would look like this 




My problem is that for now, I have to lookup each position manually using the USCS Genome Browser (, zoom out 3x, and then manually copy the leading and trailing BP and check if it registers as a Common SNP. This gets me the Context and if the original chromosome position was a SNP or not. I want this part automated, but I don't know what the best way to go about this would be. I have my reference sequence (GRCh37-lite.fa) but I'm not sure how to go about extracting individual BP from it. 

snp sequence genome • 1.8k views
ADD COMMENTlink modified 17 months ago by Biostar ♦♦ 20 • written 4.2 years ago by dnlsmy10
gravatar for Alex Reynolds
4.2 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Maybe start with your VCF. You could convert to BED, do a bedmap operation on a table of common SNPs, and then do a nucleotide lookup. With that information, you can then use awk and paste make your end product in the format you need.

1. Convert the VCF file to a seven-column BED file:

$ vcf2bed < variants.vcf | cut -f1-7 - > variants.bed7

2. Get the dbSNPs into a sorted BED file:

$ mysql -h -A -u genome -D hg19 -e 'select chrom,chromStart,chromEnd,name from snp141' | awk '{if ($2==$3) $3++; print $0}'| sort-bed - > snps.bed

3. Map variants to SNPs with bedmap:

$ bedmap --indicator variants.bed7 snps.bed > mapped_variants.indicator

The file mapped_variants.indicator is a single-column text file with a 0 or 1, depending on if there is no overlap (0) or overlap (1) with a dbSNP annotation. Each row lines up with the original variant in variants.bed7.

For convenience, we can convert this to a YES/NO result:

$ awk '{if ($1=="0") print "NO"; else print "YES";}' mapped_variants.indicator > mapped_variants.yesno

4. Retrieve 3 nt centered on the variant position.

First, pad the variant coordinates by one base on both sides:

$ bedops --everything --range 1 variants.bed7 > padded_variants.bed7

Second, run the padded variants through bed2fasta or similar to get the nucleotides:

$ --genome-file hg19 padded_variants.bed7 > padded_variants.fasta
$ grep -v '^>' padded_variants.fasta > padded_variants.nt

5. You now have everything you need to make your final product:

* The file variants.bed7 has the ALT allele in the seventh column (along with the chromosome and position in the first and second columns).
* The file mapped_variants.yesno indicates if the variant overlaps a common SNP.
* The file padded_variants.nt has the three nucleotides centered on the variant position.

We can use awk to preprocess the BED file, then pipe this to a paste command, which glues this result to the nucleotides and yes/no call:

$ awk '{print $1":"$2","$7}' variants.bed7 | paste -d, - padded_variants.nt mapped_variants.yesno > answer.txt

The file answer.txt should be in your specified format.

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Alex Reynolds28k

Wow, this is an awesome response and a bit over my head in terms of new tools. 

ADD REPLYlink written 4.2 years ago by dnlsmy10

Do each step on its own and then run head to look at the first few lines of the resulting file. Then you can see what these tools are doing.

Once you're a bit more familiar with the tools, you could glue these steps together into a script to automate (other than the dbSNP download step, which you could just do once).

ADD REPLYlink written 4.2 years ago by Alex Reynolds28k
gravatar for Chris Miller
4.2 years ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

- 'samtools faidx' is the command you want for pulling flanking sequence.

- To annotate dbSNP positions, you can download the dbSNP vcf file then use a tool like bedtools or joinx to intersect them.

ADD COMMENTlink written 4.2 years ago by Chris Miller20k


This is a good simple solution that I was looking for. I just didn't know if Samtools would be the best tool to use in this case, but since it spits out the BPs so cleanly, it's certainly viable. 

ADD REPLYlink written 4.2 years ago by dnlsmy10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 926 users visited in the last hour