Question

Dbsnp132 1000Genomes Vcf File Info Field: Bitfield Structure

5

Entering edit mode

14.6 years ago

Biomed 5.0k

The new 1000Genomes+dbSNP132 resource (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/00-All.vcf.gz) has dbSNP byte structured information about snps.

One such example is VP=050100000a0105051a000100

Considering the information found here ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf can someone please help me decypher this byte coded information about a snp. The goal is to parse this data (using python) and use it in evaluating different snps in down stream variant annotations. Thanks

dbsnp vcf genome • 8.6k views

ADD COMMENT • link updated 10.9 years ago by cmdcolin ★ 4.2k • written 14.6 years ago by Biomed 5.0k

Ram · Answer 1 · 2013-05-01

Ok I deserve some props. I think I got this to work in python without the need for bitshifting.

The trick is that you need to first convert the hex string in the VP=... field into *bytes, which involves two hex characters at a time. Next I noticed that although the bytes are arranged and read left to right, as the documentation states, the bits are arranged right-to-left!! Blaah! That made my life kinda tough for a while. The following python function does two things:

It converts the pairs of hex digits into bytes
It reverses the order of each byte string, so that it follows the same numbering format as the documentation for easy array based lookup

This also negates the need to do bitshifting, and is probably a significantly slower than it could be as a result.

def getBitsFromVP2(infoarr):
    for item in infoarr:
        if item.startswith("VP="):
            infoParts = item[3:]
            F0 = infoParts[0:2]
            F1_1 = infoParts[2:4]
            F1_2 = infoParts[4:6]
            F2_1 = infoParts[6:8]
            F2_2 = infoParts[8:10]
            F3 = infoParts[10:12]
            F4 = infoParts[12:14]
            F5 = infoParts[14:16]
            F6 = infoParts[16:18]
            F7 = infoParts[18:20]
            F8 = infoParts[20:22]
            F9 = infoParts[22:24]
            return "".join([bin(int(F0,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F1_1,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F1_2,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F2_1,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F2_2,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F3,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F4,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F5,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F6,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F7,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F8,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F9,16))[2:].rjust(8,"0")[::-1]])
    return ""

So for example you could do the following:

vpBits = getBitsFromVP2(["VP=050060000a01000002110100"])

And if you want to examine this by byte, you could do that as well:

bytesArray = [vpBits[i:i+8] for i in range(0,96,8)]

Now each of the bytes in the above array corresponds to a byte in NCBI's documentation, and additionally the bits within these bytes can be accessed with string array lookups on a particular position in a particular byte corresponding to NCBI's documentation rather than necessitating bitshifting on that byte: ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_v5.5.pdf

Ram · Answer 2 · 2010-11-24

3

Entering edit mode

14.6 years ago

Pierre Lindenbaum 166k

The bitField is handled in the NCBI C++ toolkit sources, in the snp_bitfield* files from http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/gui/objutils

You'll find here all the methods and functions to handle this field.

Update: From http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/gui/objutils/snp_bitfield_factory.cpp#L49, you can see a factory for a bitfield object. Its returns an implementation for CSnpBitfield::IEncoding. This implementation is function of the size+version(=first charchter) of the bitField. e.g. it could be a CSnpBitfield2 http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/gui/objutils/snp_bitfield_2.cpp#L80. In the constructor for this class , the string is parsed and an array of bytes is created.

Extracting the data from an array of bytes is a common task with the shift operator: see http://www.tutorialspoint.com/python/python_basic_operators.htm

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 14.6 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

Thanks for the link Pierre but I need to parse this bitfield data (using python) that is found in the vcf file's VP field and make sense out of it. Frankly this ftp site got me more confused about how to do this. Given the example above can you or someone else please elaborate a little bit more on how to do this practically? Thanks again.

ADD REPLY • link 14.6 years ago by Biomed 5.0k

1

Entering edit mode

Does anyone have info on decoding this field in perl, or bioperl?

ADD REPLY • link 14.6 years ago by Krisr ▴ 470

1

Entering edit mode

Here is an up-to-date link for snp_bitfield_2.cpp: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objtools/snputil/snp_bitfield_2.cpp

ADD REPLY • link 12.2 years ago by John St. John ★ 1.2k

Ram · Answer 3 · 2014-07-30

I was working on the dbsnp bitfield recently in javascript for a custom jbrowse implementation, and what I realized is that this isn't a true bitfield. It's actually 24 characters long, containing 12 different bitfields, each 2 character chunk corresponding to a byte encoded as hex. Therefore, I made a very similar answer to the python answer above, but I thought I'd just reiterate that for anyone in the future:

var ncbi_encoding=parseInt(name.substr(0,2),16);
var ncbi_links_1=parseInt(name.substr(2,2),16);
var ncbi_links_2=parseInt(name.substr(4,2),16);
var gene_function1=parseInt(name.substr(6,2),16);
var gene_function2=parseInt(name.substr(8,2),16);
var mapping=parseInt(name.substr(10,2),16);
var frequency=parseInt(name.substr(12,2),16);
var genotype=parseInt(name.substr(14,2),16);
var hapmap=parseInt(name.substr(16,2),16);
var phenotype=parseInt(name.substr(18,2),16);
var var_class=parseInt(name.substr(20,2),16);
var quality=parseInt(name.substr(22,2),16);

Edit 2014-08-19: I did not need to reverse bits as I originally posted. You can just use regular "numerical" bit operations on the variables above. The python answer above converts the variables into an 8 character string in bytesArray so for example, bytesArray[0] would be a string of length 8 representing binary bits going from left to right (which is pretty unnatural IMO). In mine, but you can just instead use bitwise operations on the variables in my answer.

For example, in the python answer for the VP string 050060000a01000002110100, you get

bytesArray= ['10100000', '00000000', '00000110', '00000000', '01010000', '10000000', '00000000', '00000000', '01000000', '10001000', '10000000', '00000000']

In mine, you have

[ncbi_encoding, ncbi_links_1, ncbi_links_2, gene_function1, gene_function2, mapping, frequency, genotype, hapmap, phenotype, var_class, quality]

which is

[5, 0, 96, 0, 10, 1, 0, 0, 2, 17, 1, 0]

Ram · Answer 4 · 2011-01-20

1

Entering edit mode

14.5 years ago

apfejes ▴ 160

Actually, I'm not sure it's worth extracting the information from that field in the first place. All of the information in that field is ALSO written in plain text formatting right next to the binary field, which makes it entirely redundant - and not worth the effort of parsing it.

eg:

1 10327 rs112750067 T C . . dbSNPBuildID=132;VP=050000020005000000000100;WGT=1;VC=SNP;R5;ASP

In the above case, the binary field, when deciphered should only give you back the WGT, VC, R5 and ASP flags.

So, yes, it can be done, and it's not hard, but for now, there's no point in dealing with the VP field at all.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 14.5 years ago by apfejes ▴ 160

1

Entering edit mode

Thanks for the useful answer.

ADD REPLY • link 14.5 years ago by Biomed 5.0k

1

Entering edit mode

that's not true is it? the VP field tells the function (STOP, frameshift, missense, etc), and a number of other things that arent in the text values.

ADD REPLY • link 11.9 years ago by brentp 24k

1

Entering edit mode

At the point I wrote it, they were using text flags that mirrored what was in the VP field, so it was redundant. That no longer appears to be the case.

ADD REPLY • link 9.1 years ago by apfejes ▴ 160