Question

Feasible Annotation of 4 million unique SNPs

0

Entering edit mode

6.8 years ago

serpalma.v ▴ 80

I ran the GATK tool ASEReadCounter to measure allele specific expression (ASE). There were 100 BAM input files processeced by ASEReadCounter. The output (ASE file) is a table for each input file with read counts for the reference allele and the alternative allele (SNP). However, there are no annotations in such file, which is required for later statistical analyisis (i.e. calculate ASE per gene).

My approach was to extract the SNP ids from all 100 ASE files (subsetted by minimun depth 10, which reduced file size significantly), concatenate them and unique them; I ended up with 4 million distinct SNP ids.

Using a small list of SNPs, I worked out how to retrieve, for a given SNP id, the annotations contained in the reference VCF file. The result was that for 10 distinct SNP ids the process took 15 minutes (scaling this up to 4mill ids, I get 11 years!). The VCF file contains 88million lines.

I also splitted the unique SNP ids file into 4 chunks (1mill ids per chunk) and submitted each chunk to ensembl's Variant Effect Predictor. The job has been running since 10 hours. I do not have much expectations on this one.

There must be a better way to do this I am not aware of, which is why I will greatly appreciate your input on how to proceed further.

I am also considering re-runnig the "count reads per allele analysis" with another tool that includes annotation (ASEReadCounter does not); also I would kindly ask for your suggestions on how to do this.

SNP ASE GATK • 2.0k views

ADD COMMENT • link updated 6.8 years ago by Emily 23k • written 6.8 years ago by serpalma.v ▴ 80

score 2 · Answer 1 · 2017-07-14

2

Entering edit mode

6.8 years ago

Emily 23k

The VEP should work for files that size, however we don't recommend running with the online tool for that number of variants. If you've got 4 M variants I would recommend running with the script and an offline cache.

ADD COMMENT • link 6.8 years ago by Emily 23k

0

Entering edit mode

Thank you Emily, besides the long time to complete the job, is there another negative aspect to run such many variants with the online tool?

ADD REPLY • link 6.8 years ago by serpalma.v ▴ 80

1

Entering edit mode

There is minimal security on data submitted into the VEP, so if your data is of a sensitive nature you may also prefer to keep it in-house. Other than that, not really.

ADD REPLY • link 6.8 years ago by Emily 23k

score 1 · Answer 2 · 2017-07-13

1

Entering edit mode

6.8 years ago

Maxime Lamontagne ★ 2.3k

I think ANNOVAR (http://annovar.openbioinformatics.org/en/latest/) could be useful.

ADD COMMENT • link 6.8 years ago by Maxime Lamontagne ★ 2.3k

0

Entering edit mode

Thank you very much Maxime, the annotations were finished overnight with ensembl tool VEP. I will keep ANNOVAR in mind nevertheless for the future.

ADD REPLY • link 6.8 years ago by serpalma.v ▴ 80