Sema4, a health information company, is seeking talented, self-motivated individuals to participate in leading edge work in big data analysis and with clinical diagnostics in translational bioinformatics as members of Bioinformatics R&D department. Successful applicants will be part of an interdisciplinary team that develops computational databases and methods to annotate and interpret large-scale human genome and exome sequencing data to better understand cancer mutations and the genetics of Mendelian and complex diseases. Successful applicants will also play a role in developing systems for integrating novel informatics and genomic tools and methodologies into clinical practice.
Responsibilities:
- Build and maintain comprehensive variant databases from a wide variety of public repositories.
- Identify new data sources and databases from literatures
- Build and maintain a comprehensive variant store from over 100,000 exomes.
- Assist bioinformatics scientists to integrate different types of genetic, functional, and clinical data to discover causal variants and genes for cardiovascular diseases, Alzheimer's disease, cancer, and other genetic diseases.
Requirements:
- Must have strong genomic research background
- Extensive experience with RDBMS, SQL programming (especially schema design), and ETL processes.
- Strong coding proficiency in Python, R, and Perl programming languages in a Linux environment.
- Hands-on experience building biomedical databases from public repositories, such as Uniprot, dbSNP, Medline, GTEx, 1000 Genomes, UK10K, Clinvar, COSMIC.
- Domain knowledge in genetics and genomics, especially data representation and conventions for exchanging information about genetic variants.
- Hands-on experience working with NGS and genotyping tools and data/file formats, especially VCF.
- 2 years post-graduate experience in above categories.
Desirable experience:
- Experience with Hadoop (Impala/Parquet, Spark/Shark) and programming in Java/Scala.
- Experience with clinical genetic test is a plus.
- Developing codebases using distributed version control tools (especially Git or Mercurial) and software issue tracking systems (especially JIRA).
- Deploying jobs/pipelines on a high-performance Linux computing cluster.
Location - Stamford, CT
Contact: Christine.fulton@sema4genomics.com