Get Rs Number Based On Position (6 million SNPs)
5
0
Entering edit mode
5.5 years ago
joshf ▴ 20

I know this question has sort of been asked before....but I need to know which method would be the most efficient way to get the Rs numbers based on position (hg19)

I've considered looping through two files, the .txt file (with the positions) and a .vcf file with all known variants from Kaviar Genomic Variant Database, locally...but that would take forever...

would installing a partial UCSC genome MySQL database locally be a better idea?

Any suggestion would be great...be as detailed as possible pls :).

PS: This .txt file is an output from METAL, and unfortunately I need all 6.4M SNPs for my project at this point

SNP • 4.1k views
1
Entering edit mode
5.5 years ago

I have following questions:

1) Do you have variants (~6 m SNVs) annotated with dbSNP?

2) What are the outputs supported by METAL and Would you please share a link to METAL?

0
Entering edit mode

well, I only received the METAL output TXT file so I don't actually know how it works, but it has the chromosomal position and p-value which are important to me. Usually I would have to filter by p-value, but for this particular project I can't....and no, nothing's been annotated

0
Entering edit mode

Thanks for the reply. From the link you have provided, it seems rsnumbers are provided in first column of file " 'METAANALYSIS1.TBL" (section 5.5). Could you please post first few lines of the output text here?

1) I guess you are looking for rs positions given rs id as rs ids are already present in output (as mentioned in the link provided above).

2) Your question seems to be other way round. Given a position, you are looking for rsid.

0
Entering edit mode

in this case instead of the rs numbers I was given chr:pos as the SnpID. I have the positions, but what I'm missing are the rs numbers...sorry for the confusion, and thanks for your help!

first 10 lines:

SNPID Allele1 Allele2 Freq1 FreqSE MinFreq MaxFreq Effect StdErr P Direction HetISq HetChiSq HetDf HetPVal CHR BP

5:29439275 t c 0.3751 0.0198 0.3168 0.4092 0.0100 0.0213 0.6397 +---+--+-+-+- 0.0 8.190 12 0.7701 5 29439275
5:85928892 t c 0.0643 0.0036 0.0626 0.0782 0.0916 0.0548 0.09478 +????+??-?-?? 45.1 5.462 3 0.1409 5 85928892
11:79226604 t c 0.1580 0.0117 0.1103 0.1875 -0.0301 0.0270 0.2644 -+---+-+----? 0.0 7.652 11 0.7441 11 79226604
10:45271908 c g 0.1391 0.0082 0.1056 0.1652 -0.0076 0.0296 0.7965 -+-?-+++----? 0.0 7.870 10 0.6415 10 45271908
15:80233956 t c 0.2009 0.0172 0.1191 0.2224 -0.0030 0.0256 0.9057 ++++----++--? 0.0 6.415 11 0.8443 15 80233956
10:128341232 t c 0.4460 0.0196 0.4298 0.5305 -0.0120 0.0205 0.5564 +------++++-- 15.9 14.269 12 0.2839 10 128341232
8:13397768 a g 0.8479 0.0100 0.8204 0.8662 0.0050 0.0284 0.8615 +-+---++++++? 11.4 12.413 11 0.3334 8 13397768
7:156821965 a g 0.0357 0.0039 0.0310 0.0390 -0.0336 0.0915 0.7138 -?????????-?? 0.0 0.040 1 0.8417 7 156821965
16:82370655 a g 0.1119 0.0062 0.1036 0.1267 -0.0205 0.0351 0.5589 ----?++-+?+-? 0.0 7.649 9 0.5698 16 82370655

0
Entering edit mode

1) Your data is not sorted by chromosome and coordinates.

2) Here is the way I see a simple, but round about work:

• Data has around 17 columns. Extract chromosome number, coordinate, bases (ref and alt), in that order into a separate text file.
• Now you will have chromosome number, base position, two bases (ref and alt) in new file.
• Duplicate base position and paste it next to existing base position column. Now the file will have five columns: Chromosome, Base position, Base position, Ref base and alt base
• Install bedtools.
• Use intersect command. This would not only fetch the dbSNP IDs for your variants, it would fetch other important information as well from dbSNP.

Following is the example code (on linux):

1) I copy/pasted first three lines (variants on chr5, including header) from furnished example (above) and saved it as chr5.txt
2) Using awk i extracted chromosome, position, ref and alt alleles. In the process I duplicated position column twice. (this is to make a bed file)

 command: awk '{print $16,$17,$17,$2,\$3}' chr5.txt > chr5.bed

3) Made sure that each column is separated by a tab as the original columns are not tab separated

command: tail -n +2 chr5.bed > chr5_1.bed

5) Used bedtools to intersect chr5_1.bed file with chr5.dbsnp141.hg19.vcf. (note that output file is not supplied. User can provide an output file to save the results)

command: bedtools intersect  -a chr5_1.bed -b chr5.dbsnp141.vcf -wb

6) Output is given below.

5    29439275    29439275    t    c    5    29439275    rs667647    T    C    .    .    RS=667647;RSPOS=29439275;RV;dbSNPBuildID=83;SSR=0;SAO=0;VP=0x050100000005170517000100;WGT=1;VC=SNV;SLO;ASP;VLD;G5A;G5;HD;GNO;KGPhase1;KGPROD;OTHERKG;PH3;CAF=[0.5037,0.4963];COMMON=1
5    85928892    85928892    t    c    5    85928892    rs113534962    C    T    .    .    RS=113534962;RSPOS=85928892;dbSNPBuildID=132;SSR=0;SAO=0;VP=0x050000000005140116000100;WGT=1;VC=SNV;ASP;VLD;GNO;KGPhase1;KGPROD;OTHERKG;CAF=[0.8898,0.1102];COMMON=1

7) Original records I started with:

SNPID Allele1 Allele2 Freq1 FreqSE MinFreq MaxFreq Effect StdErr P Direction HetISq HetChiSq HetDf HetPVal CHR BP
5:29439275 t c 0.3751 0.0198 0.3168 0.4092 0.0100 0.0213 0.6397 +---+--+-+-+- 0.0 8.190 12 0.7701 5 29439275
5:85928892 t c 0.0643 0.0036 0.0626 0.0782 0.0916 0.0548 0.09478 +????+??-?-?? 45.1 5.462 3 0.1409 5 85928892
0
Entering edit mode

yes, thank you so much for your help! this was exactly what I was looking for

0
Entering edit mode

dbSNP files are huge and it would take considerable time for intersecting two big files.  If you have time and familiar with R, try data.table. Authors claim to it to be fastest in intersection/overlaps. I would suggest recently implemented foverlaps function in data.table package.

0
Entering edit mode
5.5 years ago

0
Entering edit mode

could you elaborate a bit? how can I use tabix to get the rs numbers from the VCF file?

0
Entering edit mode
5.5 years ago
Veera ▴ 90

Use GenomicRanges package from bioconductor in R to overlap the your position file and the downloaded bed file.
Subset only the SNPs from overlapped region.

But you need to know to use GenomicRanges package for this. Its very fast and easy to do.

0
Entering edit mode
4.0 years ago
ShirleyDai ▴ 40

Hello, I just saw your output from METAL has the column "CHR" and "BP". Can you post you metal script as how to add these two variables to the result file. Because I have searched the whole website but cannot find a solution, so I came here to ask if you can help... Thanks Shirley

0
Entering edit mode
4.0 years ago
Bioaln ▴ 340

To avoid the double looping, try and hash parts of the genome for easier access (key region, depends on your RAM, in limit there is a key->acc mapping). This can reduce the computational time significantly.