Hi Brian,
Thanks for your response. I think I've solved my problem using a few lines of bash. Previously I made a vcf from the dbSNP file, then used this to GATK-annotate the target vcf. With the current method, I just select a few columns. Lines from the input files, and my output are below the code. There is no automated validation, such as checking alleles, length of the output columns, or omitted variants. If you can see any easy improvements, that would be useful.
## From dbSNP information file:
## 1. Remove header and blank rows.
## 2. Select columns containing only old and new variant IDs (2+5).
## 3. Sort by original variant ID (column 1).
grep -v '^#' ${dbsnp_dat} | grep -v -e '^$' | awk '{OFS="\t";print $2, $5}' | sort -k1 > body_${dbsnp_dat}
## Join the input vcf with the dbsnp information, which matches the variant IDs:
## New variant IDs end up being the last column (232), which isn't terrible but it's not ideal.
join -t $'\t' -1 3 -2 1 body_${vcf}.txt body_${dbsnp_dat} > joined.txt
## Create first three columns of new vcf file (chromosome, position, variant ID):
## Note here I've made a variant ID column which is a conjugation of the old and new IDs, separated by a semi-colon.
awk '{OFS="\t";print $2,$3,$1";"$232}' joined.txt > first3cols.txt
## Remove unecessary columns from body of temp vcf (Takes a few minutes with 1.7m variants):
cut -f4-231 < joined.txt > other_cols.txt
## Combine the first columns with the remainder, and sort by chromosome and BP. Specification of tab seperator with -t 't' dependent on operating system.
paste first3cols.txt other_cols.txt | sort -k2 -k1 -t 't' > final_body.txt
## Now put header rows on the vcf.
Top rows of my output:
chr3R   15205653        hc_abcdi;rs881319980    G       GT      64059.90        PASS    AC=135;AF=0.311;AN=434;BaseQRankSum=-2.600e-01;ClippingRankSum=-2.800e-02;DP=5447;EFF=UPSTREAM(MODIFIER||3550||632|CG4203|protein_coding|CODING|FBtr0083048||GT),DOWNSTREAM(MODIFIER||2677||503|CG42542|protein_coding|CODING|FBtr03
chrX    19680803        hc_abcdk;rs206315230    G       T       9516.41 PASS    AC=17;AF=0.038;AN=444;BaseQRankSum=0.115;ClippingRankSum=0.181;DP=5421;EFF=UPSTREAM(MODIFIER||4912||645|et|protein_coding|CODING|FBtr0308571||T),UPSTREAM(MODIFIER||3005||547|Ubqn|protein_coding|CODING|FBtr0074750||T),INTRON(MODIFIER||||
Lines from my in-house vcf:
chr2L   21539   hc_nwpsm        T       TATA    10409.49        PASS    AC=49;AF=0.113;AN=434;BaseQRankSum=0.039;ClippingRankSum=0.00;DP=3241;EFF=UPSTREAM(MODIFIER||2970||1161|l_2_gl|protein_coding|CODING|FBtr0078170||TATA),DOWNSTREAM(MODIFIER||283||842|Ir21a|protein_coding|CODING|FBtr0113008||TATA),DOWNSTREAM(MODI
chr2L   22077   hc_nodcb        C       CCAT    123.81  PASS    AC=1;AF=2.252e-03;AN=444;BaseQRankSum=-5.360e-01;ClippingRankSum=-7.840e-01;DP=6341;EFF=CODON_INSERTION(MODERATE||atc/atcATG|I790IM|842|Ir21a|protein_coding|CODING|FBtr0113008|5|CCAT),UPSTREAM(MODIFIER||3508||1161|l_2_gl|protein_coding|CODING|FBtr00781
chr2L   22081   hc_munaj        G       GAAATAACA       65.81   PASS    AC=1;AF=2.252e-03;AN=444;BaseQRankSum=1.58;ClippingRankSum=-2.840e-01;DP=6430;EFF=FRAME_SHIFT(HIGH||gac/gaTGTTATTTc|D789DVI?|842|Ir21a|protein_coding|CODING|FBtr0113008|5|GAAATAACA),UPSTREAM(MODIFIER||3512||1161|l_2_gl|protein_coding|CODING|FBt
chr2L   25378   hc_fueqn        C       T       41734.96        PASS    AC=121;AF=0.275;AN=440;BaseQRankSum=0.157;ClippingRankSum=-2.440e-01;DP=4427;EFF=UPSTREAM(MODIFIER||223||842|Ir21a|protein_coding|CODING|FBtr0113008||T),DOWNSTREAM(MODIFIER||24||1998|Cda5|protein_coding|CODING|FBtr0078164||T),INTERGENIC(MODIFIE
chr2L   25727   hc_jamie        T       G       1342.13 PASS    AC=10;AF=0.023;AN=442;BaseQRankSum=-1.560e-01;ClippingRankSum=0.383;DP=4819;EFF=UTR_3_PRIME(MODIFIER||794||1998|Cda5|protein_coding|CODING|FBtr0078164|14|G),UPSTREAM(MODIFIER||572||842|Ir21a|protein_coding|CODING|FBtr0113008||G);FS=4.030;InbreedingCoef
The same lines from the file provided by dbSNP:
ss1985661068    hc_nwpsm        -/ATA   444     rs881281674     0       2L      21539   NT_033779.5     21539   0       Release_6_plus_ISO1_MT  1
ss1985661069    hc_nodcb        -/CAT   444     rs881341942     0       2L      22077   NT_033779.5     22077   0       Release_6_plus_ISO1_MT  1
ss1985661070    hc_munaj        -/AAATAACA      444     rs881205429     0       2L      22081   NT_033779.5     22081   0       Release_6_plus_ISO1_MT  1
ss1985661071    hc_fueqn        C/T     444     rs881216460     0       2L      25378   NT_033779.5     25378   0       Release_6_plus_ISO1_MT  1
ss1985661072    hc_jamie        G/T     444     rs881333198     0       2L      25727   NT_033779.5     25727   0       Release_6_plus_ISO1_MT  1
                    
                
                 
Hi Ashutosh and Alons, and apologies for the late response,
I used GATK AnnotateVariants the and it seemed to work. I'm using a different vcf now, and it fails, really for these reasons:
I thought about trying to import the two files into R and match the IDs there, but they're too big. My current alternative is to match the IDs using bash. I've used SNPeff for other things, and shall try with this task.
The two more frustrating things are that:
Anyway, thanks again for getting back to me, and I hope to post a solution soon.
I'm not sure whether I can help you, but would you mind posting examples of the problematic data? Specifically, a couple lines from your custom-made VCF, the dbSNP indel lines, etc. Basically, a couple lines from all of the relevant files, corresponding to the same mutations.
VCF has the incredibly irritating requirement of padding indels with an unrelated nearby base, which is why it differs from dbSNP, but it's not overly difficult to programatically convert from VCF->dbSNP (the other way is harder).