Help in writing NGS pipeline (or: the road from fasta to vcf)
1
3
Entering edit mode
3.5 years ago
Hervé ▴ 30

Hi,

for various reasons I decided to try to understand better the variant calling process. I am interested in human data, however I decided to use some C Elegans data from this paper : http://genome.cshlp.org/content/23/10/1749.full -- it is a diploid organism so it should behave similarly to human data, the genome is much smaller so I expect the software to run faster...

My goal is to understand the process and to write a clear and useful tutorial. I will try to update this post with all comments and information I'll get from you. When it will be complete I may publish it elsewhere on the internet, with a CC-BY-SA license and of course I'll link to biostars.

Here is what I have done so far. I still need to add some text about the files contents etc, but I would like to know first if the pipeline is not missing something too important -- I know it can't be perfect, I have seen many things done elsewhere but I failed to understand their roles. I hope that this first attempt is not too bad, and if important stuff is missing, to get explanations about what's missing and why.

I'll be grateful for all kind of comments, correction, suggestions. Please remember that I am a complete newbie. I gathered information from many places, in particular biostars (I’ll add a list of my sources here in the next few days).

At the moment I don’t cover hereafter the "installing the software process"; I use SRA toolkit, bwa, Picard, GATK, that are installed in the folders referenced in the code chunks below -- this should be readable to anyone familiar with linux. I may add a preliminary section on obtaining and installing the software at some point.

I. Get some of the SRP018046 data

I used this this question/answer from biostar to get a csv file (I named it SRP018046.csv) describing all data files, as follows.

mkdir -p ~/NGS/fastq/
cd ~/NGS/fastq/

wget 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&rettype=runinfo&db=sra&term=SRP018046' -O - > SRP018046.csv


Here are the first 10 lines with a few interesting columns:

         Run Experiment LibraryName LibraryLayout    Sample    BioSample  SampleName
1  SRR957315  SRX339695     VC40080        PAIRED SRS473465 SAMN02334950 MMP_VC40080
2  SRR957316  SRX339696     VC40080        PAIRED SRS473465 SAMN02334950 MMP_VC40080
3  SRR957317  SRX339696     VC40080        PAIRED SRS473465 SAMN02334950 MMP_VC40080
4  SRR957318  SRX339697     VC41013        PAIRED SRS473466 SAMN02334951 MMP_VC41013
5  SRR957319  SRX339697     VC41013        PAIRED SRS473466 SAMN02334951 MMP_VC41013
6  SRR957320  SRX339698     VC40531        PAIRED SRS473467 SAMN02334952 MMP_VC40531
7  SRR957321  SRX339698     VC40531        PAIRED SRS473467 SAMN02334952 MMP_VC40531
8  SRR957322  SRX339698     VC40531        PAIRED SRS473467 SAMN02334952 MMP_VC40531
9  SRR957323  SRX339698     VC40531        PAIRED SRS473467 SAMN02334952 MMP_VC40531
10 SRR957324  SRX339699     VC40531        SINGLE SRS473467 SAMN02334952 MMP_VC40531


So these 10 files are for three different samples, all but one are paired-end reads. This should be a good starting point. It download these files using fastq-dump from the SRA toolkit, like this.

cd ~/NGS/fastq/
cut -d, -f1 ../SRP018046.csv |tail -n +2 |head -10|xargs  --max-procs 4 ~/NGS/sratoolkit/bin/fastq-dump --split-files --gzip


The --split-files option is needed to get two files for the paired-end data. If omitted, the data are in a non-standard format (both ends are on the same line).

II. Get the reference genome

I got a ftp link from http://www.ensembl.org/Caenorhabditis_elegans/Info/Index My sysadmin never figured out why ftp and lftp did not work on our cluster, but wget was ok:

mkdir ~/NGS/fasta/
cd ~/NGS/fasta/
wget -r -np ftp://ftp.ensembl.org/pub/release-90/fasta/caenorhabditis_elegans/dna/


I unzipped the file Caenorhabditis_elegans.WBcel235.dna.toplevel.fa (some of the tools later used don't accept it gzipped) and built an index with bwa

~/NGS/bwa/bwa index Caenorhabditis_elegans.WBcel235.dna.toplevel.fa


And a sequence dictionnary (here enters Picard):

PICARD="java -jar ~/NGS/Picard/picard.jar"
$PICARD CreateSequenceDictionary R=fasta/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa  III. Align data with bwa (or: from fasta to sam) I did it as follows. The -R options adds a read group to each file, allowing to identify which sample it comes from (eg SM:VC40080); if we don't do that at some point, there's now way the variant calling step will know about the samples. mkdir ~/NGS/sam/ cd ~/NGS/ REFERENCE="~/NGS/fasta/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa" ~/NGS/bwa/bwa mem -M$REFERENCE -R "@RG\tID:SRR957315\tSM:VC40080" fastq/SRR957315_1.fastq.gz fastq/SRR957315_2.fastq.gz > sam/SRR957315.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957316\tSM:VC40080" fastq/SRR957316_1.fastq.gz fastq/SRR957316_2.fastq.gz > sam/SRR957316.sam ~/NGS/bwa/bwa mem -M$REFERENCE -R "@RG\tID:SRR957317\tSM:VC40080" fastq/SRR957317_1.fastq.gz fastq/SRR957317_2.fastq.gz > sam/SRR957317.sam

~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957318\tSM:VC41013" fastq/SRR957318_1.fastq.gz fastq/SRR957318_2.fastq.gz > sam/SRR957318.sam ~/NGS/bwa/bwa mem -M$REFERENCE -R "@RG\tID:SRR957319\tSM:VC41013" fastq/SRR957319_1.fastq.gz fastq/SRR957319_2.fastq.gz > sam/SRR957319.sam

~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957320\tSM:VC40531" fastq/SRR957320_1.fastq.gz fastq/SRR957320_2.fastq.gz > sam/SRR957320.sam ~/NGS/bwa/bwa mem -M$REFERENCE -R "@RG\tID:SRR957321\tSM:VC40531" fastq/SRR957321_1.fastq.gz fastq/SRR957321_2.fastq.gz > sam/SRR957321.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957322\tSM:VC40531" fastq/SRR957322_1.fastq.gz fastq/SRR957322_2.fastq.gz > sam/SRR957322.sam ~/NGS/bwa/bwa mem -M$REFERENCE -R "@RG\tID:SRR957323\tSM:VC40531" fastq/SRR957323_1.fastq.gz fastq/SRR957323_2.fastq.gz > sam/SRR957323.sam
~/NGS/bwa/bwa mem -M $REFERENCE -R "@RG\tID:SRR957324\tSM:VC40531" fastq/SRR957324_1.fastq.gz > sam/SRR957324.sam  IV. There is more to do before variant calling (creating bam files) First thing, create a unique bam file. mkdir ~/NGS/bam cd ~/NGS/$PICARD MergeSamFiles \
I=sam/SRR957315.sam \
I=sam/SRR957316.sam \
I=sam/SRR957317.sam \
I=sam/SRR957318.sam \
I=sam/SRR957319.sam \
I=sam/SRR957320.sam \
I=sam/SRR957321.sam \
I=sam/SRR957322.sam \
I=sam/SRR957323.sam \
O=bam/all_samples.bam \
SO=unsorted \


We need to sort this file.

$PICARD SortSam I=bam/all_samples.bam O=bam/all_samples_sorted.bam SO=coordinate ;  And to mark PCR Duplicates $PICARD MarkDuplicates \
I=bam/all_samples_sorted.bam \
O=bam/all_samples_dedup.bam \
ASSUME_SORTED=yes \
METRICS_FILE=bam/metric_file.metrics ;


Last step before variant calling is to index it.

$PICARD BuildBamIndex I=bam/all_samples_dedup.bam ;  V. Last but to not least, call variants with GATK (creating vcf) mkdir ~/NGS/vcf/ cd ~/NGS/ GATK="java -jar ~/NGS/GATK/GenomeAnalysisTK.jar"$GATK HaplotypeCaller  -I bam/all_samples_dedup.bam  -o vcf/essai.vcf -R \$REFERENCE


I am eagerly waiting for your feedback!

next-gen alignment • 2.5k views
1
Entering edit mode

You can combine a the alignment, sorting and conversion to bam in one step:

bwa mem reference.fa -R "blabla" reads_1.fastq.gz reads_2.fastq.gz | samtools sort -o alignment.bam

0
Entering edit mode

That pipeline will detect SNVs and small indels. You may also be interested in detecting copy number changes, larger structural variants, or other variants in regions (such as microsatellites and telomeres) requiring specialised variant callers.

4
Entering edit mode
3.5 years ago
JJ ▴ 560

You should follow the GATK best practices - there are a couple of steps missing, especially recalibration and filtering Have a look at GATK best practices

0
Entering edit mode

OK, thanks for the pointer. I find the reference hard to read for the newcomer. Any explanation/advice about these steps?

0
Entering edit mode

You need to do a base recalibration before SNP calling. Look here for a tutorial how to do this. If you do not have known SNPs you might do two rounds of SNP calling and pass the best/filtered SNPs to recalibration instead of the known SNPs. This is to correct for sequencing errors.

I would also recommend calling SNPs in the gvcf mode. This is particularly important if you have more than just a single sample and you can apply joint genotyping, which would be the next step after SNP calling.

After the SNP calling (or joint genotyping) you need to do some filtering. Here you can use variant quality score recalibration (VQSR) if you have truth/training variant sets or hard filtering if you don't.

0
Entering edit mode

You need to do a base recalibration before SNP calling.

It depends... Have a look in the discussion we had here some days ago.