Dbsnp132 1000Genomes Vcf File Info Field: Bitfield Structure
4
5
Entering edit mode
11.7 years ago
Biomed 4.8k

The new 1000Genomes+dbSNP132 resource (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/00-All.vcf.gz) has dbSNP byte structured information about snps.

One such example is VP=050100000a0105051a000100

Considering the information found here ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf can someone please help me decypher this byte coded information about a snp. The goal is to parse this data (using python) and use it in evaluating different snps in down stream variant annotations. Thanks

dbsnp vcf genome • 6.3k views
4
Entering edit mode
9.3 years ago
John St. John ★ 1.2k

Ok I deserve some props. I think I got this to work in python without the need for bitshifting.

The trick is that you need to first convert the hex string in the VP=... field into *bytes, which involves two hex characters at a time. Next I noticed that although the bytes are arranged and read left to right, as the documentation states, the bits are arranged right-to-left!! Blaah! That made my life kinda tough for a while. The following python function does two things:

1. It converts the pairs of hex digits into bytes
2. It reverses the order of each byte string, so that it follows the same numbering format as the documentation for easy array based lookup

This also negates the need to do bitshifting, and is probably a significantly slower than it could be as a result.

def getBitsFromVP2(infoarr):
for item in infoarr:
if item.startswith("VP="):
infoParts = item[3:]
F0 = infoParts[0:2]
F1_1 = infoParts[2:4]
F1_2 = infoParts[4:6]
F2_1 = infoParts[6:8]
F2_2 = infoParts[8:10]
F3 = infoParts[10:12]
F4 = infoParts[12:14]
F5 = infoParts[14:16]
F6 = infoParts[16:18]
F7 = infoParts[18:20]
F8 = infoParts[20:22]
F9 = infoParts[22:24]
return "".join([bin(int(F0,16))[2:].rjust(8,"0")[::-1],
bin(int(F1_1,16))[2:].rjust(8,"0")[::-1],
bin(int(F1_2,16))[2:].rjust(8,"0")[::-1],
bin(int(F2_1,16))[2:].rjust(8,"0")[::-1],
bin(int(F2_2,16))[2:].rjust(8,"0")[::-1],
bin(int(F3,16))[2:].rjust(8,"0")[::-1],
bin(int(F4,16))[2:].rjust(8,"0")[::-1],
bin(int(F5,16))[2:].rjust(8,"0")[::-1],
bin(int(F6,16))[2:].rjust(8,"0")[::-1],
bin(int(F7,16))[2:].rjust(8,"0")[::-1],
bin(int(F8,16))[2:].rjust(8,"0")[::-1],
bin(int(F9,16))[2:].rjust(8,"0")[::-1]])
return ""


So for example you could do the following:

vpBits = getBitsFromVP2(["VP=050060000a01000002110100"])


And if you want to examine this by byte, you could do that as well:

bytesArray = [vpBits[i:i+8] for i in range(0,96,8)]


Now each of the bytes in the above array corresponds to a byte in NCBI's documentation, and additionally the bits within these bytes can be accessed with string array lookups on a particular position in a particular byte corresponding to NCBI's documentation rather than necessitating bitshifting on that byte: ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_v5.5.pdf

1
Entering edit mode

I don't understand why you're reversing the bin representation of each field. Did you find that documented somewhere?

1
Entering edit mode

I also wonder why the answer's author feels the need to reverse the bits. My best guess is that he is unfamiliar with the convention that, these days, bit ordering is typically most significant bit first. This is the convention followed in Python; for example, try

for i in range(9):
print(i, bin(i))


This is why, in the PDFs linked here, the bits are indexed in decreasing order from top to bottom (a 90-degree rotation of how they would be indexed from right to left). I can see how this would be confusing, and dbSNP should provide more explicit documentation.

3
Entering edit mode
11.7 years ago

The bitField is handled in the NCBI C++ toolkit sources, in the snp_bitfield* files from http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/gui/objutils

You'll find here all the methods and functions to handle this field.

Update: From http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/gui/objutils/snp_bitfield_factory.cpp#L49, you can see a factory for a bitfield object. Its returns an implementation for CSnpBitfield::IEncoding. This implementation is function of the size+version(=first charchter) of the bitField. e.g. it could be a CSnpBitfield2 http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/gui/objutils/snp_bitfield_2.cpp#L80. In the constructor for this class , the string is parsed and an array of bytes is created.

Extracting the data from an array of bytes is a common task with the shift operator: see http://www.tutorialspoint.com/python/python_basic_operators.htm

1
Entering edit mode

Thanks for the link Pierre but I need to parse this bitfield data (using python) that is found in the vcf file's VP field and make sense out of it. Frankly this ftp site got me more confused about how to do this. Given the example above can you or someone else please elaborate a little bit more on how to do this practically? Thanks again.

1
Entering edit mode

Does anyone have info on decoding this field in perl, or bioperl?

1
Entering edit mode
2
Entering edit mode
8.1 years ago
cmdcolin ★ 2.3k

I was working on the dbsnp bitfield recently in javascript for a custom jbrowse implementation, and what I realized is that this isn't a true bitfield. It's actually 24 characters long, containing 12 different bitfields, each 2 character chunk corresponding to a byte encoded as hex. Therefore, I made a very similar answer to the python answer above, but I thought I'd just reiterate that for anyone in the future:

var ncbi_encoding=parseInt(name.substr(0,2),16);
var gene_function1=parseInt(name.substr(6,2),16);
var gene_function2=parseInt(name.substr(8,2),16);
var mapping=parseInt(name.substr(10,2),16);
var frequency=parseInt(name.substr(12,2),16);
var genotype=parseInt(name.substr(14,2),16);
var hapmap=parseInt(name.substr(16,2),16);
var phenotype=parseInt(name.substr(18,2),16);
var var_class=parseInt(name.substr(20,2),16);
var quality=parseInt(name.substr(22,2),16);


Edit 2014-08-19: I did not need to reverse bits as I originally posted. You can just use regular "numerical" bit operations on the variables above. The python answer above converts the variables into an 8 character string in bytesArray so for example, bytesArray[0] would be a string of length 8 representing binary bits going from left to right (which is pretty unnatural IMO). In mine, but you can just instead use bitwise operations on the variables in my answer.

For example, in the python answer for the VP string 050060000a01000002110100, you get

bytesArray= ['10100000', '00000000', '00000110', '00000000', '01010000', '10000000', '00000000', '00000000', '01000000', '10001000', '10000000', '00000000']


In mine, you have

[ncbi_encoding, ncbi_links_1, ncbi_links_2, gene_function1, gene_function2, mapping, frequency, genotype, hapmap, phenotype, var_class, quality]


which is

[5, 0, 96, 0, 10, 1, 0, 0, 2, 17, 1, 0]

1
Entering edit mode
11.6 years ago
apfejes ▴ 160

Actually, I'm not sure it's worth extracting the information from that field in the first place. All of the information in that field is ALSO written in plain text formatting right next to the binary field, which makes it entirely redundant - and not worth the effort of parsing it.

eg:

1 10327 rs112750067 T C . . dbSNPBuildID=132;VP=050000020005000000000100;WGT=1;VC=SNP;R5;ASP


In the above case, the binary field, when deciphered should only give you back the WGT, VC, R5 and ASP flags.

So, yes, it can be done, and it's not hard, but for now, there's no point in dealing with the VP field at all.

1
Entering edit mode

Thanks for the useful answer.

1
Entering edit mode

that's not true is it? the VP field tells the function (STOP, frameshift, missense, etc), and a number of other things that arent in the text values.

1
Entering edit mode

At the point I wrote it, they were using text flags that mirrored what was in the VP field, so it was redundant. That no longer appears to be the case.