I know this question has sort of been asked before....but I need to know which method would be the most efficient way to get the Rs numbers based on position (hg19)
I've considered looping through two files, the .txt file (with the positions) and a .vcf file with all known variants from Kaviar Genomic Variant Database, locally...but that would take forever...
would installing a partial UCSC genome MySQL database locally be a better idea?
Any suggestion would be great...be as detailed as possible pls :).
PS: This .txt file is an output from METAL, and unfortunately I need all 6.4M SNPs for my project at this point
well, I only received the METAL output TXT file so I don't actually know how it works, but it has the chromosomal position and p-value which are important to me. Usually I would have to filter by p-value, but for this particular project I can't....and no, nothing's been annotated
Thanks for the reply. From the link you have provided, it seems rsnumbers are provided in first column of file " 'METAANALYSIS1.TBL" (section 5.5). Could you please post first few lines of the output text here?
1) I guess you are looking for rs positions given rs id as rs ids are already present in output (as mentioned in the link provided above).
2) Your question seems to be other way round. Given a position, you are looking for rsid.
in this case instead of the rs numbers I was given chr:pos as the SnpID. I have the positions, but what I'm missing are the rs numbers...sorry for the confusion, and thanks for your help!
first 10 lines:
1) Your data is not sorted by chromosome and coordinates.
2) Here is the way I see a simple, but round about work:
Following is the example code (on linux):
1) I copy/pasted first three lines (variants on chr5, including header) from furnished example (above) and saved it as chr5.txt
2) Using awk i extracted chromosome, position, ref and alt alleles. In the process I duplicated position column twice. (this is to make a bed file)
3) Made sure that each column is separated by a tab as the original columns are not tab separated
4) Deleted the header line.
5) Used bedtools to intersect chr5_1.bed file with chr5.dbsnp141.hg19.vcf. (note that output file is not supplied. User can provide an output file to save the results)
6) Output is given below.
7) Original records I started with:
yes, thank you so much for your help! this was exactly what I was looking for
dbSNP files are huge and it would take considerable time for intersecting two big files. If you have time and familiar with R, try data.table. Authors claim to it to be fastest in intersection/overlaps. I would suggest recently implemented foverlaps function in data.table package.