Question

Getting BasePairs/SNPs/Variants from Chromosome Positions

0

Entering edit mode

9.3 years ago

dnlsmy ▴ 10

Hi, I need some help automating and speeding up my data analysis.

For now, I track variants from BAM files using mpileup (samtools) and obtain a CSV (converted from VCF using BCFTools) with a structure like this:

CHROM:POS,ORIGINALBP/VARIANTBP

So it looks like this

5:112043282,C/G

What I want to do is add two more fields, one for CONTEXT (which would give me the BP before and after the original chromosome position) and whether or not the original BP is a Common SNP.

So it would look like this

CHROM:POS,ORIGINALBP/VARIANTBP,CONTEXT,SNP
5:112043282,C/G,TCG,YES

My problem is that for now, I have to lookup each position manually using the USCS Genome Browser (https://genome.ucsc.edu), zoom out 3x, and then manually copy the leading and trailing BP and check if it registers as a Common SNP. This gets me the Context and if the original chromosome position was a SNP or not. I want this part automated, but I don't know what the best way to go about this would be. I have my reference sequence (GRCh37-lite.fa) but I'm not sure how to go about extracting individual BP from it.

SNP genome sequence • 3.5k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by dnlsmy ▴ 10

1

Entering edit mode

9.3 years ago

Chris Miller 22k

- 'samtools faidx' is the command you want for pulling flanking sequence.

- To annotate dbSNP positions, you can download the dbSNP vcf file then use a tool like bedtools or joinx to intersect them.

ADD COMMENT • link 9.3 years ago by Chris Miller 22k

0

Entering edit mode

Thanks!

This is a good simple solution that I was looking for. I just didn't know if Samtools would be the best tool to use in this case, but since it spits out the BPs so cleanly, it's certainly viable.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by dnlsmy ▴ 10

Ram · Accepted Answer · 2015-02-03

Maybe start with your VCF. You could convert to BED, do a bedmap operation on a table of common SNPs, and then do a nucleotide lookup. With that information, you can then use awk and paste make your end product in the format you need.

1. Convert the VCF file to a seven-column BED file:

$ vcf2bed < variants.vcf | cut -f1-7 - > variants.bed7

2. Get the dbSNPs into a sorted BED file:

$ mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg19 -e 'select chrom,chromStart,chromEnd,name from snp141' | awk '{if ($2==$3) $3++; print $0}'| sort-bed - > snps.bed

3. Map variants to SNPs with bedmap:

$ bedmap --indicator variants.bed7 snps.bed > mapped_variants.indicator

The file mapped_variants.indicator is a single-column text file with a 0 or 1, depending on if there is no overlap (0) or overlap (1) with a dbSNP annotation. Each row lines up with the original variant in variants.bed7.

For convenience, we can convert this to a YES/NO result:

$ awk '{if ($1=="0") print "NO"; else print "YES";}' mapped_variants.indicator > mapped_variants.yesno

4. Retrieve 3 nt centered on the variant position.

First, pad the variant coordinates by one base on both sides:

$ bedops --everything --range 1 variants.bed7 > padded_variants.bed7

Second, run the padded variants through bed2fasta or similar to get the nucleotides:

$ bed2fasta.py --genome-file hg19 padded_variants.bed7 > padded_variants.fasta
$ grep -v '^>' padded_variants.fasta > padded_variants.nt

5. You now have everything you need to make your final product:

The file variants.bed7 has the ALT allele in the seventh column (along with the chromosome and position in the first and second columns).
The file mapped_variants.yesno indicates if the variant overlaps a common SNP.
The file padded_variants.nt has the three nucleotides centered on the variant position.

We can use awk to preprocess the BED file, then pipe this to a paste command, which glues this result to the nucleotides and yes/no call:

$ awk '{print $1":"$2","$7}' variants.bed7 | paste -d, - padded_variants.nt mapped_variants.yesno > answer.txt

The file answer.txt should be in your specified format.