Question

Need code to filter VCF files based on rsid using Pysam

0

Entering edit mode

9.3 years ago

devenvyas ▴ 740

Hello, I need to collect a set of ~330,000 SNPs from a set of extended-VCF files (available here and here. The only information I have on these SNPs are there dbSNP rs#s.

I've tried to do this using vcftools, but vcftools f'ed up the extended format aspects of the VCF files, and thus the output was useless.

I have been told by one of the individuals who generated these original files that I absolutely cannot use vcftools or GATK on these files because they break the extended format. I have been told that I need to use a "a simple VCF parser (SAMtools library/tabix implementation, e.g. pysam)", but I am a noob and don't know how to do this or most of what this means. I've asked my Python/computational biology professor for help on this before, and we could not figure this out. There is more filtering beyond just filtering rsids (e.g., modernC-archaicT or modernG-archaicA sites), but I am holding off on that for now, since I want to get the simple task done first.

I need to get some Pysam python code to use as a template to filter these vcf files. Can someone please help. Thanks!

-Deven

vcf python dbSNP • 4.3k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by devenvyas ▴ 740

Ram · Answer 1 · 2015-01-16

0

Entering edit mode

9.3 years ago

Tariq Daouda ▴ 220

Hi Deven,

I never used pysam but I have written a simple VCF parser that you might find useful, it's part of pyGeno, but you won't need to import any genomes if you only want to use the parser.

from pyGeno.tools.parsers.VCFTools import VCFFile

f = VCFFile()
f.parse('hop.vcf.gz', *gziped=True*)

for line in f :
  print line
   #or print line['pos'] for the position of the SNP for example.

The full doc is here.

Could you give us more details about the filtering you need to perform, along with a few lines of the VCFs? And perhaps the finality of all this. pyGeno also natively supports dbSNP SNPs and allows you to create custom filters on these SNPs.

Cheers

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Tariq Daouda ▴ 220

0

Entering edit mode

For any given rsid, I need corresponding the whole line, the code looks like it will just print the whole line for every line it reads through. Do you know how I can get it to only print the line if it contains one of the rs ids?

The two sets of VCF files are here and here and they are broken up by chromosome. First, I need to go through all 22 autosomes for each and pare it down, so it is just covering the lines corresponding the ~330,000 rsids. Then I need to further filter out case where modern humans have a reference C or G and the archaic human has a T or A (respectively). This part gets confusing in how the reference/alternate coding works in the extended literature, so I am holding that filtering off until I've re-read through the methods literature.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by devenvyas ▴ 740

0

Entering edit mode

The files for the first link are quite big, and the second one does not work.

Try print line["rsId"] or whatever this column is named, that should give you the rs id for the current line. I am not sure what you mean by "confusing in how the reference/alternate". Couldn't you just use the ref given to you by dbSNP or the reference human genome?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Tariq Daouda ▴ 220

0

Entering edit mode

http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/

print line["rsId"] would just get me the rsid for the line, what I need is for it the go through the sets of file and copy all of the line for only the lines containing the ~330,000 rsids.

I would imagine would need an if statement after for line in f: saying if line contains one of my rsid then print line, but my Python is really bad, and I don't know how to format that if statement. That is what I need help with at the present.

In terms of the reference sequence, that I can trust, but the alternative modern human allele data in the Neanderthal file (and in certain cases in the Denisovan file) is not actually stored in the alt column, they are stored in the INFO field under 1000gALT. Just ignore anything about filtering beyond the rsids for now, because I am still trying to figure out what I need to filter in the first place.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by devenvyas ▴ 740

0

Entering edit mode

The parser works the following way:

If the field you are looking for is on the first level, like pos, you do line["pos"], if it's in 'INFO' line["INFO"]["rsId"], if it's in 'FILTER' line["FILTER"]["rsId"]. If there's no value you will get a None. So it's simply

 for line in f :

  if line["INFO"]["rsId"] is not None :
    print line

To get a list of files in a folder, take a look a the glob module of python: glob.glob will get you the list of all files in a directory.

For the human reference, the answer is a little bit longer. I would suggest that you import the human reference genome into pyGeno(quickstart), retrieve the chromosome and then simply do allele = chromosome.sequence[pos]. Where position is the position of your SNP, that should give the allele. Then it's simply a matter of:

if allele == 'C' :
   ....
elif allele == 'G':
 ....

Hope that helps

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Tariq Daouda ▴ 220

0

Entering edit mode

You are understanding anything that I am saying.

I do not need any specific fields. I need entire lines, but I only need entire lines which contain specific rsids. The code you were trying to give me would copy any line that had any rsid irrespective of whether or not I wanted that rsid. The code you have been giving is not what I need.

For example, in every line of the file there is an ID column, where the rsid (if the site has one) is listed. What I need the code to do is to look in that ID column, and if the ID column contains one of the ~330,000 rsids, then it will print that line to an outfile. This is what I need, but I do not have a bloody clue how to write that if statement.

from pyGeno.tools.parsers.VCFTools import VCFFile

f = VCFFile()
f.parse('hop.vcf.gz', *gziped=True*)

for line in f :
  if line["ID"] "in the list of rsids"
   print out line

The if statement within the for is what I need help writing right now. Can you please help me code that?

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    AltaiNea
22    16050037    .    G    .    .    .    DP=1;HaplotypeScore=0.0000;MQ=0.00;MQ0=1    GT:A:C:G:T:IR    ./.:0,0:0,0:1,0:0,0:0
22    16050038    .    A    .    .    .    DP=2;HaplotypeScore=0.0000;MQ=0.00;MQ0=2    GT:A:C:G:T:IR    ./.:2,0:0,0:0,0:0,0:0
22    16050039    .    C    .    .    .    DP=3;HaplotypeScore=0.0000;MQ=0.00;MQ0=3    GT:A:C:G:T:IR    ./.:0,0:3,0:0,0:0,0:0
22    16050040    .    C    .    .    .    DP=3;HaplotypeScore=0.0000;MQ=0.00;MQ0=3    GT:A:C:G:T:IR    ./.:0,0:3,0:0,0:0,0:0
22    16050041    .    T    .    .    .    DP=4;HaplotypeScore=0.0000;MQ=0.00;MQ0=4    GT:A:C:G:T:IR    ./.:0,0:0,0:0,0:3,1:0
22    16050042    .    T    .    .    .    DP=4;HaplotypeScore=0.0000;MQ=0.00;MQ0=4    GT:A:C:G:T:IR    ./.:0,0:0,0:0,0:3,1:0
22    16050043    .    G    .    .    .    DP=4;HaplotypeScore=0.0000;MQ=0.00;MQ0=4    GT:A:C:G:T:IR    ./.:0,0:0,0:3,1:0,0:0
22    16050044    .    G    .    .    .    DP=4;HaplotypeScore=0.0000;MQ=0.00;MQ0=4    GT:A:C:G:T:IR    ./.:0,0:0,0:3,1:0,0:0
22    16050045    .    C    .    .    .    DP=6;HaplotypeScore=0.0000;MQ=0.00;MQ0=6    GT:A:C:G:T:IR    ./.:0,0:4,1:0,0:1,0:0
22    16050046    .    C    .    .    .    DP=6;HaplotypeScore=0.0000;MQ=0.00;MQ0=6    GT:A:C:G:T:IR    ./.:0,0:5,1:0,0:0,0:0
22    16050047    .    A    .    .    .    DP=11;HaplotypeScore=0.0000;MQ=0.00;MQ0=11    GT:A:C:G:T:IR    ./.:5,6:0,0:0,0:0,0:0
22    16050048    .    A    .    .    .    DP=11;HaplotypeScore=0.0000;MQ=0.00;MQ0=11    GT:A:C:G:T:IR    ./.:5,6:0,0:0,0:0,0:0
22    16050049    .    G    .    .    .    DP=11;HaplotypeScore=0.0000;MQ=0.00;MQ0=11    GT:A:C:G:T:IR    ./.:0,0:0,0:5,6:0,0:0
22    16050050    .    T    .    .    .    DP=12;HaplotypeScore=0.0000;MQ=0.00;MQ0=12    GT:A:C:G:T:IR    ./.:0,0:0,0:0,0:6,6:0

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by devenvyas ▴ 740

Ram · Answer 2 · 2015-01-17

0

Entering edit mode

9.3 years ago

devenvyas ▴ 740

Anyone else have any solutions, whether it be how to code that if statement from the post above (none of that code above is of any help if I can't get that if statement to work) or to address this by using PySam. (It is important to realize if I were to use PyGeno, I would need to go through the sometimes long process of getting HPC to install it)

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by devenvyas ▴ 740

Ram · Answer 3 · 2015-01-17

Ok, now it starts making more sense, do:

snpList = ["rs...", "rs...", "rs...", ...]
snpSet = set(snpList)

and then replace the previous loop by this one:

for line in f :
  if line["ID"] in snpSet:
   print line

It is always a good idea to convert a list to a set when you want to perform a lot of "in"s. The research is way faster.

Here's a code that, if I understood you correctly, should help you with the whole thing, including the comparison with the ref.

from pyGeno.tools.parsers.VCFTools import VCFFile
from pyGeno.Genome import *

ref = Genome(name = "GRCh37.75")

f = VCFFile()
f.parse('hop.vcf.gz', gziped=True)

snpList = ["rs...", "rs...", "rs...", ...]
snpSet = set(snpList)

for line in f :
  if line["ID"] in snpSet:
   chromosome = ref.get(number = line["#CHROM"])[0]
   alelle = chromosome.sequence[line["POS"]]

   if allele == 'C' :
     ....
   elif allele == 'G':
     ....

This is a full pyGeno example. If you need to install python packages locally you can do it easily by creating a virtualenv.

If you are unable to install anything locally you can simply download my parser from here and copy paste it in your current folder. No need to install pyGeno for that, the parser is independent from the rest.

Cheers