Question: Dbsnp132 1000Genomes Vcf File Info Field: Bitfield Structure
5
gravatar for Biomed
9.4 years ago by
Biomed4.6k
Bethesda, MD, USA
Biomed4.6k wrote:

The new 1000Genomes+dbSNP132 resource (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/00-All.vcf.gz) has dbSNP byte structured information about snps.

One such example is VP=050100000a0105051a000100

Considering the information found here ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf can someone please help me decypher this byte coded information about a snp. The goal is to parse this data (using python) and use it in evaluating different snps in down stream variant annotations. Thanks

vcf genome dbsnp • 5.3k views
ADD COMMENTlink modified 5.7 years ago by cmdcolin1.3k • written 9.4 years ago by Biomed4.6k
4
gravatar for John St. John
6.9 years ago by
John St. John1.2k
San Francisco, CA, Cancer Therapeutics Innovation Group
John St. John1.2k wrote:

Ok I deserve some props. I think I got this to work in python without the need for bitshifting.

The trick is that you need to first convert the hex string in the VP=... field into *bytes, which involves two hex characters at a time. Next I noticed that although the bytes are arranged and read left to right, as the documentation states, the bits are arranged right-to-left!! Blaah! That made my life kinda tough for a while. The following python function does two things:

  1. It converts the pairs of hex digits into bytes
  2. It reverses the order of each byte string, so that it follows the same numbering format as the documentation for easy array based lookup

This also negates the need to do bitshifting, and is probably a significantly slower than it could be as a result.

def getBitsFromVP2(infoarr):
    for item in infoarr:
        if item.startswith("VP="):
            infoParts = item[3:]
            F0 = infoParts[0:2]
            F1_1 = infoParts[2:4]
            F1_2 = infoParts[4:6]
            F2_1 = infoParts[6:8]
            F2_2 = infoParts[8:10]
            F3 = infoParts[10:12]
            F4 = infoParts[12:14]
            F5 = infoParts[14:16]
            F6 = infoParts[16:18]
            F7 = infoParts[18:20]
            F8 = infoParts[20:22]
            F9 = infoParts[22:24]
            return "".join([bin(int(F0,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F1_1,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F1_2,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F2_1,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F2_2,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F3,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F4,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F5,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F6,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F7,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F8,16))[2:].rjust(8,"0")[::-1],
                            bin(int(F9,16))[2:].rjust(8,"0")[::-1]])
    return ""

So for example you could do the following:

vpBits = getBitsFromVP2(["VP=050060000a01000002110100"])

And if you want to examine this by byte, you could do that as well:

bytesArray = [vpBits[i:i+8] for i in range(0,96,8)]

Now each of the bytes in the above array corresponds to a byte in NCBI's documentation, and additionally the bits within these bytes can be accessed with string array lookups on a particular position in a particular byte corresponding to NCBI's documentation rather than necessitating bitshifting on that byte: ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_v5.5.pdf

ADD COMMENTlink modified 7 months ago by RamRS26k • written 6.9 years ago by John St. John1.2k
1

I don't understand why you're reversing the bin representation of each field. Did you find that documented somewhere?

ADD REPLYlink written 6.6 years ago by brentp23k
1

I also wonder why the answer's author feels the need to reverse the bits. My best guess is that he is unfamiliar with the convention that, these days, bit ordering is typically most significant bit first. This is the convention followed in Python; for example, try

for i in range(9):
    print(i, bin(i))

This is why, in the PDFs linked here, the bits are indexed in decreasing order from top to bottom (a 90-degree rotation of how they would be indexed from right to left). I can see how this would be confusing, and dbSNP should provide more explicit documentation.

ADD REPLYlink modified 7 months ago by RamRS26k • written 5.7 years ago by Gotgenes460
3
gravatar for Pierre Lindenbaum
9.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum127k wrote:

The bitField is handled in the NCBI C++ toolkit sources, in the snp_bitfield* files from http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/gui/objutils

You'll find here all the methods and functions to handle this field.

Update: From http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/gui/objutils/snp_bitfield_factory.cpp#L49, you can see a factory for a bitfield object. Its returns an implementation for CSnpBitfield::IEncoding. This implementation is function of the size+version(=first charchter) of the bitField. e.g. it could be a CSnpBitfield2 http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/gui/objutils/snp_bitfield_2.cpp#L80. In the constructor for this class , the string is parsed and an array of bytes is created.

Extracting the data from an array of bytes is a common task with the shift operator: see http://www.tutorialspoint.com/python/python_basic_operators.htm

ADD COMMENTlink modified 7 months ago by RamRS26k • written 9.4 years ago by Pierre Lindenbaum127k
1

Thanks for the link Pierre but I need to parse this bitfield data (using python) that is found in the vcf file's VP field and make sense out of it. Frankly this ftp site got me more confused about how to do this. Given the example above can you or someone else please elaborate a little bit more on how to do this practically? Thanks again.

ADD REPLYlink written 9.4 years ago by Biomed4.6k
1

Does anyone have info on decoding this field in perl, or bioperl?

ADD REPLYlink written 9.4 years ago by Krisr460
1

Here is an up-to-date link for snp_bitfield_2.cpp: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objtools/snputil/snp_bitfield_2.cpp

ADD REPLYlink written 6.9 years ago by John St. John1.2k
2
gravatar for cmdcolin
5.7 years ago by
cmdcolin1.3k
United States
cmdcolin1.3k wrote:

I was working on the dbsnp bitfield recently in javascript for a custom jbrowse implementation, and what I realized is that this isn't a true bitfield. It's actually 24 characters long, containing 12 different bitfields, each 2 character chunk corresponding to a byte encoded as hex. Therefore, I made a very similar answer to the python answer above, but I thought I'd just reiterate that for anyone in the future:

var ncbi_encoding=parseInt(name.substr(0,2),16);
var ncbi_links_1=parseInt(name.substr(2,2),16);
var ncbi_links_2=parseInt(name.substr(4,2),16);
var gene_function1=parseInt(name.substr(6,2),16);
var gene_function2=parseInt(name.substr(8,2),16);
var mapping=parseInt(name.substr(10,2),16);
var frequency=parseInt(name.substr(12,2),16);
var genotype=parseInt(name.substr(14,2),16);
var hapmap=parseInt(name.substr(16,2),16);
var phenotype=parseInt(name.substr(18,2),16);
var var_class=parseInt(name.substr(20,2),16);
var quality=parseInt(name.substr(22,2),16);

Edit 2014-08-19: I did not need to reverse bits as I originally posted. You can just use regular "numerical" bit operations on the variables above. The python answer above converts the variables into an 8 character string in bytesArray so for example, bytesArray[0] would be a string of length 8 representing binary bits going from left to right (which is pretty unnatural IMO). In mine, but you can just instead use bitwise operations on the variables in my answer.

For example, in the python answer for the VP string 050060000a01000002110100, you get

bytesArray= ['10100000', '00000000', '00000110', '00000000', '01010000', '10000000', '00000000', '00000000', '01000000', '10001000', '10000000', '00000000']

In mine, you have

[ncbi_encoding, ncbi_links_1, ncbi_links_2, gene_function1, gene_function2, mapping, frequency, genotype, hapmap, phenotype, var_class, quality]

which is

[5, 0, 96, 0, 10, 1, 0, 0, 2, 17, 1, 0]
ADD COMMENTlink modified 7 months ago by RamRS26k • written 5.7 years ago by cmdcolin1.3k
1
gravatar for apfejes
9.2 years ago by
apfejes160
Vancouver
apfejes160 wrote:

Actually, I'm not sure it's worth extracting the information from that field in the first place. All of the information in that field is ALSO written in plain text formatting right next to the binary field, which makes it entirely redundant - and not worth the effort of parsing it.

eg:

1 10327 rs112750067 T C . . dbSNPBuildID=132;VP=050000020005000000000100;WGT=1;VC=SNP;R5;ASP

In the above case, the binary field, when deciphered should only give you back the WGT, VC, R5 and ASP flags.

So, yes, it can be done, and it's not hard, but for now, there's no point in dealing with the VP field at all.

ADD COMMENTlink modified 7 months ago by RamRS26k • written 9.2 years ago by apfejes160
1

Thanks for the useful answer.

ADD REPLYlink written 9.2 years ago by Biomed4.6k
1

that's not true is it? the VP field tells the function (STOP, frameshift, missense, etc), and a number of other things that arent in the text values.

ADD REPLYlink written 6.6 years ago by brentp23k
1

At the point I wrote it, they were using text flags that mirrored what was in the VP field, so it was redundant. That no longer appears to be the case.

ADD REPLYlink written 3.9 years ago by apfejes160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1484 users visited in the last hour