Question: Get Rs Number Based On Position (6 million SNPs)
0
gravatar for joshf
4.0 years ago by
joshf20
Netherlands
joshf20 wrote:

I know this question has sort of been asked before....but I need to know which method would be the most efficient way to get the Rs numbers based on position (hg19)

I've considered looping through two files, the .txt file (with the positions) and a .vcf file with all known variants from Kaviar Genomic Variant Database, locally...but that would take forever...

would installing a partial UCSC genome MySQL database locally be a better idea? 

Any suggestion would be great...be as detailed as possible pls :).

PS: This .txt file is an output from METAL, and unfortunately I need all 6.4M SNPs for my project at this point

snp • 3.2k views
ADD COMMENTlink modified 2.4 years ago by Bioaln310 • written 4.0 years ago by joshf20
1
gravatar for cpad0112
4.0 years ago by
cpad011211k
India
cpad011211k wrote:

I have following questions:

1) Do you have variants (~6 m SNVs) annotated with dbSNP?

2) What are the outputs supported by METAL and Would you please share a link to METAL?
 

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by cpad011211k

well, I only received the METAL output TXT file so I don't actually know how it works, but it has the chromosomal position and p-value which are important to me. Usually I would have to filter by p-value, but for this particular project I can't....and no, nothing's been annotated 

 

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by joshf20

Thanks for the reply. From the link you have provided, it seems rsnumbers are provided in first column of file " 'METAANALYSIS1.TBL" (section 5.5). Could you please post first few lines of the output text here?

1) I guess you are looking for rs positions given rs id as rs ids are already present in output (as mentioned in the link provided above).

2) Your question seems to be other way round. Given a position, you are looking for rsid.

 

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by cpad011211k

in this case instead of the rs numbers I was given chr:pos as the SnpID. I have the positions, but what I'm missing are the rs numbers...sorry for the confusion, and thanks for your help!

first 10 lines:

SNPID Allele1 Allele2 Freq1 FreqSE MinFreq MaxFreq Effect StdErr P Direction HetISq HetChiSq HetDf HetPVal CHR BP

5:29439275 t c 0.3751 0.0198 0.3168 0.4092 0.0100 0.0213 0.6397 +---+--+-+-+- 0.0 8.190 12 0.7701 5 29439275
5:85928892 t c 0.0643 0.0036 0.0626 0.0782 0.0916 0.0548 0.09478 +????+??-?-?? 45.1 5.462 3 0.1409 5 85928892
11:79226604 t c 0.1580 0.0117 0.1103 0.1875 -0.0301 0.0270 0.2644 -+---+-+----? 0.0 7.652 11 0.7441 11 79226604
10:45271908 c g 0.1391 0.0082 0.1056 0.1652 -0.0076 0.0296 0.7965 -+-?-+++----? 0.0 7.870 10 0.6415 10 45271908
15:80233956 t c 0.2009 0.0172 0.1191 0.2224 -0.0030 0.0256 0.9057 ++++----++--? 0.0 6.415 11 0.8443 15 80233956
10:128341232 t c 0.4460 0.0196 0.4298 0.5305 -0.0120 0.0205 0.5564 +------++++-- 15.9 14.269 12 0.2839 10 128341232
8:13397768 a g 0.8479 0.0100 0.8204 0.8662 0.0050 0.0284 0.8615 +-+---++++++? 11.4 12.413 11 0.3334 8 13397768
7:156821965 a g 0.0357 0.0039 0.0310 0.0390 -0.0336 0.0915 0.7138 -?????????-?? 0.0 0.040 1 0.8417 7 156821965
16:82370655 a g 0.1119 0.0062 0.1036 0.1267 -0.0205 0.0351 0.5589 ----?++-+?+-? 0.0 7.649 9 0.5698 16 82370655

 

ADD REPLYlink written 4.0 years ago by joshf20

1) Your data is not sorted by chromosome and coordinates.

2) Here is the way I see a simple, but round about work:

  • Data has around 17 columns. Extract chromosome number, coordinate, bases (ref and alt), in that order into a separate text file. 
  • Now you will have chromosome number, base position, two bases (ref and alt) in new file.
  • Duplicate base position and paste it next to existing base position column. Now the file will have five columns: Chromosome, Base position, Base position, Ref base and alt base
  • Download genome build specific dbSNP file.
  • Install bedtools.
  • Use intersect command. This would not only fetch the dbSNP IDs for your variants, it would fetch other important information as well from dbSNP.

Following is the example code (on linux):

1) I copy/pasted first three lines (variants on chr5, including header) from furnished example (above) and saved it as chr5.txt
2) Using awk i extracted chromosome, position, ref and alt alleles. In the process I duplicated position column twice. (this is to make a bed file)

 command: awk '{print $16,$17,$17, $2,$3}' chr5.txt > chr5.bed

3) Made sure that each column is separated by a tab as the original columns are not tab separated

4) Deleted the header line.

command: tail -n +2 chr5.bed > chr5_1.bed

5) Used bedtools to intersect chr5_1.bed file with chr5.dbsnp141.hg19.vcf. (note that output file is not supplied. User can provide an output file to save the results) 

command: bedtools intersect  -a chr5_1.bed -b chr5.dbsnp141.vcf -wb

6) Output is given below.

5    29439275    29439275    t    c    5    29439275    rs667647    T    C    .    .    RS=667647;RSPOS=29439275;RV;dbSNPBuildID=83;SSR=0;SAO=0;VP=0x050100000005170517000100;WGT=1;VC=SNV;SLO;ASP;VLD;G5A;G5;HD;GNO;KGPhase1;KGPROD;OTHERKG;PH3;CAF=[0.5037,0.4963];COMMON=1
5    85928892    85928892    t    c    5    85928892    rs113534962    C    T    .    .    RS=113534962;RSPOS=85928892;dbSNPBuildID=132;SSR=0;SAO=0;VP=0x050000000005140116000100;WGT=1;VC=SNV;ASP;VLD;GNO;KGPhase1;KGPROD;OTHERKG;CAF=[0.8898,0.1102];COMMON=1

7) Original records I started with:

SNPID Allele1 Allele2 Freq1 FreqSE MinFreq MaxFreq Effect StdErr P Direction HetISq HetChiSq HetDf HetPVal CHR BP
5:29439275 t c 0.3751 0.0198 0.3168 0.4092 0.0100 0.0213 0.6397 +---+--+-+-+- 0.0 8.190 12 0.7701 5 29439275
5:85928892 t c 0.0643 0.0036 0.0626 0.0782 0.0916 0.0548 0.09478 +????+??-?-?? 45.1 5.462 3 0.1409 5 85928892
ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by cpad011211k

yes, thank you so much for your help! this was exactly what I was looking for

ADD REPLYlink written 4.0 years ago by joshf20

dbSNP files are huge and it would take considerable time for intersecting two big files.  If you have time and familiar with R, try data.table. Authors claim to it to be fastest in intersection/overlaps. I would suggest recently implemented foverlaps function in data.table package.

ADD REPLYlink written 4.0 years ago by cpad011211k
0
gravatar for Pierre Lindenbaum
4.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

download the dbsnp VCF and query with tabix.

ADD COMMENTlink written 4.0 years ago by Pierre Lindenbaum122k

could you elaborate a bit? how can I use tabix to get the rs numbers from the VCF file?

ADD REPLYlink written 4.0 years ago by joshf20
0
gravatar for Veera
4.0 years ago by
Veera 90
Denmark
Veera 90 wrote:

Download the coordinates BED file from UCSC according to your build.
Use GenomicRanges package from bioconductor in R to overlap the your position file and the downloaded bed file.
Subset only the SNPs from overlapped region.

But you need to know to use GenomicRanges package for this. Its very fast and easy to do.

 

 

 

ADD COMMENTlink written 4.0 years ago by Veera 90
0
gravatar for ShirleyDai
2.4 years ago by
ShirleyDai40
ShirleyDai40 wrote:

Hello, I just saw your output from METAL has the column "CHR" and "BP". Can you post you metal script as how to add these two variables to the result file. Because I have searched the whole website but cannot find a solution, so I came here to ask if you can help... Thanks Shirley

ADD COMMENTlink written 2.4 years ago by ShirleyDai40
0
gravatar for Bioaln
2.4 years ago by
Bioaln310
France
Bioaln310 wrote:

To avoid the double looping, try and hash parts of the genome for easier access (key region, depends on your RAM, in limit there is a key->acc mapping). This can reduce the computational time significantly.

ADD COMMENTlink written 2.4 years ago by Bioaln310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1691 users visited in the last hour