Question

List Of Dbsnp Variants Where Reference Genome Has Minor Allele

2

Entering edit mode

12.4 years ago

Ana Rodrigues Grant ▴ 30

While working on a couple of exome projects, I've ran into situations where for some of the variants we are calling, the reference genome allele is annotated in dbSNP as the minor allele (MAF<5%), and as such the variants are not so interesting to us.

Here is an example:rs7185 < http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=7185 > ref=C -> alt=T MAF/MinorAlleleCount: C=0.142/311

I would like to filter those out, but cannot find a simple file with that information... I am using the 00-All.vcf.gz from dbSNP to identify known variants, which lists whether some of the alleles are low frequency, but not which one is which...

I would like to be able to flag these variants - does someone know where I can get that data from, ideally somewhere that gets updated with dbSNP versions? Thank you so much!

reference dbsnp variant exome • 5.0k views

ADD COMMENT • link updated 4.1 years ago by Biostar 20 • written 12.4 years ago by Ana Rodrigues Grant ▴ 30

score 3 · Answer 1 · 2011-12-08

Generate some java classes using ${JAVA_HOME}/bin/xjc and compile the java program below:

 xjc "ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_3.3.xsd" 
 javac Biostar15272.java gov/nih/nlm/ncbi/snp/docsum/*.java

Source:

import gov.nih.nlm.ncbi.snp.docsum.*;

import java.util.ArrayList;
import java.io.InputStream;
import java.util.List;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;

/**
 xjc "ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_3.3.xsd" 
 javac Biostar15272.java gov/nih/nlm/ncbi/snp/docsum/*.java
*/
public class Biostar15272
    {

    private void run(InputStream in) throws Exception
        {

        XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
        xmlInputFactory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.TRUE);
        xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
        xmlInputFactory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.TRUE);
        JAXBContext jaxbCtxt=JAXBContext.newInstance(Rs.class);
        Unmarshaller unmarshaller=jaxbCtxt.createUnmarshaller();

            XMLEventReader reader= xmlInputFactory.createXMLEventReader(in); 
            boolean headerPrinted=false;
            while(reader.hasNext())
                {
                XMLEvent evt=reader.peek();
                if(!evt.isStartElement())
                    {
                    reader.nextEvent();
                    continue;
                    }
                StartElement start=evt.asStartElement();
                String localName=start.getName().getLocalPart();

                if(!localName.equals("Rs"))
                    {
                    reader.nextEvent();
                    continue;
                    }

               Rs rs=unmarshaller.unmarshal(reader, Rs.class).getValue();
               for(Rs.Frequency f:rs.getFrequency())
                    {
                    System.out.println("rs"+rs.getRsId()+
                        "\t"+f.getFreq()+
                        "\t"+f.getAllele()
                        );
                    }
               }
            reader.close();
        }
    public static void main(String[] args)
        {
        try
            {
            new Biostar15272().runSystem.in);
            }
        catch (Exception e)
            {
            e.printStackTrace();
            }
        }
    }

and then invoke this parser for dbsnp to print the frequencies:

curl -s  "ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/XML/ds_ch22.xml.gz" |\
gunzip -c | java -cp . Biostar15272

rs783   0.388   G
rs805   0.3067  C
rs820   0.1901  G
rs1056  0.2984  G
rs1118  0.3528  C
rs1119  0.4817  G
rs1222  0.1161  C
rs1312  0.4685  C
rs1314  0.1229  G
rs1315  0.1499  C
rs1317  0.1499  T
rs1320  0.0297  A
rs1321  0.3441  C
rs1608  0.1024  T
rs1654  0.1559  G
rs1656  0.1878  G
rs1682  0.1522  C
rs1803  0.2399  T
rs1858  0.0347  T
rs1859  0.0119  T
rs1860  0.3259  A
rs1866  0.308   T
rs1878  0.3729  A
rs1900  0.1463  C
rs1916  0.3286  C
rs1947  0.0718  C
rs1953  0.1622  T
rs2387  0.3048  T
rs2393  0.0178  C
rs2481  0.2893  A
rs2564  0.1545  C
rs2731  0.3857  G
rs2782  0.362   T
rs2913  0.3921  G
rs3179  0.4735  C
rs3233  0.2715  C
(...)

score 2 · Answer 2 · 2011-12-09

2

Entering edit mode

12.4 years ago

User 59 13k

You might also want to take a look at this PLoS paper from Dewey et al. where they create synthetic 'major allele' references for various populations.

ADD COMMENT • link 12.4 years ago by User 59 13k

0

Entering edit mode

Key here is "for various populations" because what is a minor allele in one population could be major in another. The reference genome could represent some degree of admixture, for example, and so that "minor" allele may be major from the population contributing to the admixture.

ADD REPLY • link 12.4 years ago by Larry_Parnell 16k

0

Entering edit mode

Interesting paper Daniel - thanks! I don't have population information for the current datasets... for now, I am planning to filter these particular alleles only when they are also annotated as G5A (i.e. reference genome is minor allele, and minor allele is <5% in all populations).

ADD REPLY • link 12.4 years ago by Ana Rodrigues Grant ▴ 30

0

Entering edit mode

Interesting paper Daniel - thanks! I don't have population information for the current datasets... for now, I am planning to tag these particular alleles and use that in combination with the G5A annotation to de-prioritize these variants (i.e. reference genome is minor allele, and minor allele is <5% in all populations, then the sample allele is not that interesting to us).

ADD REPLY • link 12.4 years ago by Ana Rodrigues Grant ▴ 30

Ram · Answer 3 · 2015-12-23

1

Entering edit mode

8.3 years ago

mike ▴ 60

@Pierre, many thanks for your post. I've actually got nothing after running

curl -s "ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/XML/ds_ch22.xml.gz" |\
gunzip -c | java -cp . Biostar15272

Do you know any reason? I've just followed your descriptions above.

Thanks for your comment.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by mike ▴ 60

0

Entering edit mode

It's a problem in the new files from the NCBI, I've just left a message to the helpdesk:

Hi NCBI,

I think there is an error in the XML / XSD for dbsnp.

in "ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_3.4.xsd" the target Namespace is
targetNamespace="https://www.ncbi.nlm.nih.gov/SNP/docsum"
  
while in the XML it 'http', not 'httpS'
   curl -s  "ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/XML/ds_ch22.xml.gz" | gunzip -c
   
   <ExchangeSet xmlns:xsi="&lt;a href=" http:="" www.w3.org="" 2001="" XMLSchema-instance"="" rel="nofollow">http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm.nih.gov/SNP/docsum"
  

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks a lot. Could you maybe also update when NCBI replied? It's really appreciated.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by mike ▴ 60

0

Entering edit mode

no, and don't expect an answer during those days :-)

ADD REPLY • link 8.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

here is an update where I used sed to convert 'http' to 'https'

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks so much. It did help!

ADD REPLY • link 8.3 years ago by mike ▴ 60

0

Entering edit mode

This looks great!

ADD REPLY • link 8.3 years ago by genefan • 0

0

Entering edit mode

furthermore, the schema have changed since 4 years.

ADD REPLY • link 8.3 years ago by Pierre Lindenbaum 161k