Getting BasePairs/SNPs/Variants from Chromosome Positions
2
0
Entering edit mode
9.3 years ago
dnlsmy ▴ 10

Hi, I need some help automating and speeding up my data analysis.

For now, I track variants from BAM files using mpileup (samtools) and obtain a CSV (converted from VCF using BCFTools) with a structure like this:

CHROM:POS,ORIGINALBP/VARIANTBP

So it looks like this

5:112043282,C/G

What I want to do is add two more fields, one for CONTEXT (which would give me the BP before and after the original chromosome position) and whether or not the original BP is a Common SNP.

So it would look like this

CHROM:POS,ORIGINALBP/VARIANTBP,CONTEXT,SNP
5:112043282,C/G,TCG,YES

My problem is that for now, I have to lookup each position manually using the USCS Genome Browser (https://genome.ucsc.edu), zoom out 3x, and then manually copy the leading and trailing BP and check if it registers as a Common SNP. This gets me the Context and if the original chromosome position was a SNP or not. I want this part automated, but I don't know what the best way to go about this would be. I have my reference sequence (GRCh37-lite.fa) but I'm not sure how to go about extracting individual BP from it.

SNP genome sequence • 3.5k views
ADD COMMENT
5
Entering edit mode
9.3 years ago

Maybe start with your VCF. You could convert to BED, do a bedmap operation on a table of common SNPs, and then do a nucleotide lookup. With that information, you can then use awk and paste make your end product in the format you need.

1. Convert the VCF file to a seven-column BED file:

$ vcf2bed < variants.vcf | cut -f1-7 - > variants.bed7

2. Get the dbSNPs into a sorted BED file:

$ mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg19 -e 'select chrom,chromStart,chromEnd,name from snp141' | awk '{if ($2==$3) $3++; print $0}'| sort-bed - > snps.bed

3. Map variants to SNPs with bedmap:

$ bedmap --indicator variants.bed7 snps.bed > mapped_variants.indicator

The file mapped_variants.indicator is a single-column text file with a 0 or 1, depending on if there is no overlap (0) or overlap (1) with a dbSNP annotation. Each row lines up with the original variant in variants.bed7.

For convenience, we can convert this to a YES/NO result:

$ awk '{if ($1=="0") print "NO"; else print "YES";}' mapped_variants.indicator > mapped_variants.yesno

4. Retrieve 3 nt centered on the variant position.

First, pad the variant coordinates by one base on both sides:

$ bedops --everything --range 1 variants.bed7 > padded_variants.bed7

Second, run the padded variants through bed2fasta or similar to get the nucleotides:

$ bed2fasta.py --genome-file hg19 padded_variants.bed7 > padded_variants.fasta
$ grep -v '^>' padded_variants.fasta > padded_variants.nt

5. You now have everything you need to make your final product:

  • The file variants.bed7 has the ALT allele in the seventh column (along with the chromosome and position in the first and second columns).
  • The file mapped_variants.yesno indicates if the variant overlaps a common SNP.
  • The file padded_variants.nt has the three nucleotides centered on the variant position.

We can use awk to preprocess the BED file, then pipe this to a paste command, which glues this result to the nucleotides and yes/no call:

$ awk '{print $1":"$2","$7}' variants.bed7 | paste -d, - padded_variants.nt mapped_variants.yesno > answer.txt

The file answer.txt should be in your specified format.

ADD COMMENT
0
Entering edit mode

Wow, this is an awesome response and a bit over my head in terms of new tools.

ADD REPLY
0
Entering edit mode

Do each step on its own and then run head to look at the first few lines of the resulting file. Then you can see what these tools are doing.

Once you're a bit more familiar with the tools, you could glue these steps together into a script to automate (other than the dbSNP download step, which you could just do once).

ADD REPLY
1
Entering edit mode
9.3 years ago

- 'samtools faidx' is the command you want for pulling flanking sequence.

- To annotate dbSNP positions, you can download the dbSNP vcf file then use a tool like bedtools or joinx to intersect them.

ADD COMMENT
0
Entering edit mode

Thanks!

This is a good simple solution that I was looking for. I just didn't know if Samtools would be the best tool to use in this case, but since it spits out the BPs so cleanly, it's certainly viable.

ADD REPLY

Login before adding your answer.

Traffic: 1793 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6