Generate a file with SNPs given a WGS dataset
1
0
Entering edit mode
29 days ago
iibrams07 • 0

Given whole genome sequencing or whole-exome sequencing datasets of the human genome, I need to generate a list of all the SNPs present in each of the datasets separately. In the best scenario the SNPs should be well annotated, as for their genes they belong to, their coordinates, and eventually additional attributes. The file should have a .csv format or some other similar format, since I need to upload it and manipulate some data from it.

How should I proceed?

Many thanks.

annotation snp ngs • 538 views
1
Entering edit mode
29 days ago
Dave Carlson ▴ 620

There are a lot of ways to accomplish this general workflow, but my suggestion would be something along the following lines:

1. Map reads to the human reference genome with BWA MEM
2. Produce a bam file for each sample after sorting by position with samtools
3. Mark duplicates with Picard
4. Jointly call variants for all samples with GATK, following their best practices pipeline.
5. Annotate the variants using GATK's Funcotator, SnpEff, or Annovar
6. Depending on what specifically you want to do next, process the annotated VCF using vcftools, bcftools, or maybe PLINK to get additional information
0
Entering edit mode

@Dave Carlson. Many thanks. Once at step 6., what should I do to collect only the SNPs and discard other variants? I need to pipe this collection to maybe a text file. In the best practice pipeline, you are referring to, there is no SNP-based pipeline regarding the somatic case, it rather refers to germline SNPs. Do you know of a resource that provides the command line code to proceed step by step in achieving this goal?

1
Entering edit mode

To exclude all but SNPs from your VCF file you can use GATK's SelecVariants tool with --select-type-to-include SNP or bcftools with --skip-variants indels.

From the rest of your comment, it sounds like you're doing somatic variant calling. Is that right? If so, GATK has a separate best practices workflow for this, though I don't have any personal experience with it.

0
Entering edit mode

The samples the DNA was extracted from were somatic cells. It is surprising that GATK provides more pipelines for germ cells than somatic ones. The last best practice workflow you are referring to is not clearly telling if it is about SNPs. SNV is not the same as SNPs. It is totally different. Thus I am confused. Can you comment on this? I got another question. When I open the Github extracted code file, I find a .jason and .wdl file. What should I do with these files? What I need is the code. Is the code inside one of these files? How can I read them? Thanks.

0
Entering edit mode

SNP = single nucleotide polymorphism

SNV = single nucleotide variant

These two things are subtlety different, but they are not totally different.

I'm not sure which precise github page you're looking at, but if you're referring to the code for GATK, it's written in JAVA, and the easiest way to run it would be to download the latest release and call the wrapper script gatk.

0
Entering edit mode

By code I mean the command line code needed to run the pipeline and not to download GATK. Clicking at best practices workflow, one is led to a page of pipelines which are themselves linked to github pages where a code file is situated.