Hi,
for various reasons I decided to try to understand better the variant calling process. I am interested in human data, however I decided to use some C Elegans data from this paper : http://genome.cshlp.org/content/23/10/1749.full -- it is a diploid organism so it should behave similarly to human data, the genome is much smaller so I expect the software to run faster...
My goal is to understand the process and to write a clear and useful tutorial. I will try to update this post with all comments and information I'll get from you. When it will be complete I may publish it elsewhere on the internet, with a CC-BY-SA license and of course I'll link to biostars.
Here is what I have done so far. I still need to add some text about the files contents etc, but I would like to know first if the pipeline is not missing something too important -- I know it can't be perfect, I have seen many things done elsewhere but I failed to understand their roles. I hope that this first attempt is not too bad, and if important stuff is missing, to get explanations about what's missing and why.
I'll be grateful for all kind of comments, correction, suggestions. Please remember that I am a complete newbie. I gathered information from many places, in particular biostars (I’ll add a list of my sources here in the next few days).
At the moment I don’t cover hereafter the "installing the software process"; I use SRA toolkit, bwa, Picard, GATK, that are installed in the folders referenced in the code chunks below -- this should be readable to anyone familiar with linux. I may add a preliminary section on obtaining and installing the software at some point.
I. Get some of the SRP018046 data
I used this this question/answer from biostar to get a csv file (I named it SRP018046.csv
) describing all data files, as follows.
mkdir -p ~/NGS/fastq/
cd ~/NGS/fastq/
wget 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&rettype=runinfo&db=sra&term=SRP018046' -O - > SRP018046.csv
Here are the first 10 lines with a few interesting columns:
Run Experiment LibraryName LibraryLayout Sample BioSample SampleName
1 SRR957315 SRX339695 VC40080 PAIRED SRS473465 SAMN02334950 MMP_VC40080
2 SRR957316 SRX339696 VC40080 PAIRED SRS473465 SAMN02334950 MMP_VC40080
3 SRR957317 SRX339696 VC40080 PAIRED SRS473465 SAMN02334950 MMP_VC40080
4 SRR957318 SRX339697 VC41013 PAIRED SRS473466 SAMN02334951 MMP_VC41013
5 SRR957319 SRX339697 VC41013 PAIRED SRS473466 SAMN02334951 MMP_VC41013
6 SRR957320 SRX339698 VC40531 PAIRED SRS473467 SAMN02334952 MMP_VC40531
7 SRR957321 SRX339698 VC40531 PAIRED SRS473467 SAMN02334952 MMP_VC40531
8 SRR957322 SRX339698 VC40531 PAIRED SRS473467 SAMN02334952 MMP_VC40531
9 SRR957323 SRX339698 VC40531 PAIRED SRS473467 SAMN02334952 MMP_VC40531
10 SRR957324 SRX339699 VC40531 SINGLE SRS473467 SAMN02334952 MMP_VC40531
So these 10 files are for three different samples, all but one are paired-end reads. This should be a good starting point. It download these files using fastq-dump
from the SRA toolkit, like this.
cd ~/NGS/fastq/
cut -d, -f1 ../SRP018046.csv |tail -n +2 |head -10|xargs --max-procs 4 ~/NGS/sratoolkit/bin/fastq-dump --split-files --gzip
The --split-files
option is needed to get two files for the paired-end data. If omitted, the data are in a non-standard format (both ends are on the same line).
II. Get the reference genome
I got a ftp link from http://www.ensembl.org/Caenorhabditis_elegans/Info/Index My sysadmin never figured out why ftp and lftp did not work on our cluster, but wget was ok:
mkdir ~/NGS/fasta/
cd ~/NGS/fasta/
wget -r -np ftp://ftp.ensembl.org/pub/release-90/fasta/caenorhabditis_elegans/dna/
I unzipped the file Caenorhabditis_elegans.WBcel235.dna.toplevel.fa
(some of the tools later used don't accept it gzipped) and built an index with bwa
~/NGS/bwa/bwa index Caenorhabditis_elegans.WBcel235.dna.toplevel.fa
And a sequence dictionnary (here enters Picard):
PICARD="java -jar ~/NGS/Picard/picard.jar"
$PICARD CreateSequenceDictionary R=fasta/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa
III. Align data with bwa (or: from fasta to sam)
I did it as follows. The -R
options adds a read group to each file, allowing to identify which sample it comes from (eg SM:VC40080
); if we don't do that at some point, there's now way the variant calling step will know about the samples.
mkdir ~/NGS/sam/
cd ~/NGS/
REFERENCE="~/NGS/fasta/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa"
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957315\tSM:VC40080" fastq/SRR957315_1.fastq.gz fastq/SRR957315_2.fastq.gz > sam/SRR957315.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957316\tSM:VC40080" fastq/SRR957316_1.fastq.gz fastq/SRR957316_2.fastq.gz > sam/SRR957316.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957317\tSM:VC40080" fastq/SRR957317_1.fastq.gz fastq/SRR957317_2.fastq.gz > sam/SRR957317.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957318\tSM:VC41013" fastq/SRR957318_1.fastq.gz fastq/SRR957318_2.fastq.gz > sam/SRR957318.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957319\tSM:VC41013" fastq/SRR957319_1.fastq.gz fastq/SRR957319_2.fastq.gz > sam/SRR957319.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957320\tSM:VC40531" fastq/SRR957320_1.fastq.gz fastq/SRR957320_2.fastq.gz > sam/SRR957320.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957321\tSM:VC40531" fastq/SRR957321_1.fastq.gz fastq/SRR957321_2.fastq.gz > sam/SRR957321.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957322\tSM:VC40531" fastq/SRR957322_1.fastq.gz fastq/SRR957322_2.fastq.gz > sam/SRR957322.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957323\tSM:VC40531" fastq/SRR957323_1.fastq.gz fastq/SRR957323_2.fastq.gz > sam/SRR957323.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957324\tSM:VC40531" fastq/SRR957324_1.fastq.gz > sam/SRR957324.sam
IV. There is more to do before variant calling (creating bam files)
First thing, create a unique bam file.
mkdir ~/NGS/bam
cd ~/NGS/
$PICARD MergeSamFiles \
I=sam/SRR957315.sam \
I=sam/SRR957316.sam \
I=sam/SRR957317.sam \
I=sam/SRR957318.sam \
I=sam/SRR957319.sam \
I=sam/SRR957320.sam \
I=sam/SRR957321.sam \
I=sam/SRR957322.sam \
I=sam/SRR957323.sam \
O=bam/all_samples.bam \
SO=unsorted \
USE_THREADING=true;
We need to sort this file.
$PICARD SortSam I=bam/all_samples.bam O=bam/all_samples_sorted.bam SO=coordinate ;
And to mark PCR Duplicates
$PICARD MarkDuplicates \
I=bam/all_samples_sorted.bam \
O=bam/all_samples_dedup.bam \
ASSUME_SORTED=yes \
METRICS_FILE=bam/metric_file.metrics ;
Last step before variant calling is to index it.
$PICARD BuildBamIndex I=bam/all_samples_dedup.bam ;
V. Last but to not least, call variants with GATK (creating vcf)
mkdir ~/NGS/vcf/
cd ~/NGS/
GATK="java -jar ~/NGS/GATK/GenomeAnalysisTK.jar"
$GATK HaplotypeCaller -I bam/all_samples_dedup.bam -o vcf/essai.vcf -R $REFERENCE
I am eagerly waiting for your feedback!
You can combine a the alignment, sorting and conversion to bam in one step:
That pipeline will detect SNVs and small indels. You may also be interested in detecting copy number changes, larger structural variants, or other variants in regions (such as microsatellites and telomeres) requiring specialised variant callers.