Reproduce Genome in a Bottle analysis from NA12878 fastq?
3
3
Entering edit mode
6 months ago
yussab ▴ 30

Hi Everyone,

I recently heard about the genome in a bottle project.

I found raw fastq data for NA12878 samples from their GitHub repository. Now I'm trying to reproduce their analysis.

https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/NA12878/sequence.index.NA12878_Illumina_HiSeq_Exome_Garvan_fastq_09252015

I would like to know if it's available a reference code to compare my pipeline (not only the final result but the code).

Thank you in advance,

AY

genome in a bottle NA12878 • 388 views
ADD COMMENT
0
Entering edit mode

Publications, bench-marking best practices is all linked from main GIAB project page.

ADD REPLY
5
Entering edit mode
6 months ago

Hi Yussab, as you did not replay, so i do not know what are you planning to do exactly. it seems that you do have your own pipeline which mean you have the final VCF where it's generated from NA12878. i know these 2 tools to compare you result with GIAB Data set. (I do not know if there is something else available)

1- Hap.py (https://github.com/Illumina/hap.py):

hap.py truth.vcf query.vcf -f confident.bed -o output_prefix -r reference.fa

2-RTG tools (https://github.com/RealTimeGenomics/rtg-tools):

generate SDF file from Fasta file (Your reference):

rtg format "Your_Fasta_File".fasta -o "File_Name_ToOut_Put_Result"

Compare the tow VCF file:

rtg vcfeval -b  truth.vcf -c query.vcf -o Directary_To_OUTPUT_THE_RESULT -t PATH_TO_SDF_FILE -e GIAB_BED_FILE.bed  --region= Your_Bed_File.bed

I highly recommend you to read the tools manual for better understanding.

ADD COMMENT
0
Entering edit mode

Thank you very much for the answer and sorry for the delay. Actually I'm still writing my pipeline, but this will be very useful for my next step. I was wondering if there is a reference pipeline by GIAB, to compare with mine.

ADD REPLY
3
Entering edit mode
6 months ago

as far as i know, GIAB provides High_Conf.bed + High_Conf.vcf (Available here ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/), No Fastq files are provided. but Garvan data is available here (ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/) with the bed file where you need it for variants calling and benchmark process.

1- download the reference genome (1-22, X, Y and MT) from any database (i used USCS) join all fasta files into one fasta by 'cat' command in lunix.

2- index your reference

bwa index -a bwtsw  'filesName'.fasta

3- align your read by bwa

bwa mem 'filesName'.fasta 'FileName'_1.fastq 'FileName'_2.fastq > 'FileName'.sam

4-Convert Sam file to Bam file

samtools view -@ '#of Thread you want to use as int.' -bS  'FileName'.sam -o  'FileName'.bam

5-Add read group to bam files as needed by GATK

picard-tools AddOrReplaceReadGroups I='InFileName'.bam O='OutFileName'.bam RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM='SampleName' SORT_ORDER=coordinate CREATE_INDEX=TRUE VALIDATION_STRINGENCY=LENIENT

6-Remove duplicated read

picard-tools MarkDuplicates I='InFileName'.bam O='OutFileName'.bam M=marked_dup_metrics.txt REMOVE_DUPLICATES=true

7-Index last Bam for Varients calling

samtools index -@  '#of Thread you want to use as int.' 'FileName'.bam

8-Varients Calling

gatk HaplotypeCaller --native-pair-hmm-threads ''#of Thread you want to use as int.'' -I='InFileName'.bam -O='OutFileName'.vcf -R='Ref_SeqName'.fasta -L='BedFileName'.bed

9-Now you are ready to compare you VCF (Quary.vcf) with GIAB VCF (True.vcf)

rtg vcfeval  -b True.vcf regions='GIAB_BedFile'.bed -c Quary.vcf --evaluation-regions='YourBedFile' -o 'Path to output the reult' -t 'Path to SDF file'

Note:

  1. Do not forget to create SDF file from you reference sequence by rtg tools format
  2. Please see 'https://cdn.rawgit.com/RealTimeGenomics/rtg-tools/master/installer/resources/tools/RTGOperationsManual/index.html' for better understanding
  3. Use Fastqc 'or similar' to check the data quality and trim them if needed

after you achieve the result you want, try to make scrip from the above code in order to process sample in batches, but it the result still not reach you needs read in each tool documents and change the parameter accordingly.

I hope it was clear! Best wishes!

ADD COMMENT
0
Entering edit mode

Hi a.alnawfal.1992, thank you for your kind answer... So it's not necessary to process the Variant Filtration/Refine Genotypes/Variant Annotation steps??? In this way you use raw vcf snps + indels??

ADD REPLY
0
Entering edit mode

Hi, it's completely depends on what you want to achieve. For example if you Decided to filter variants that have DP less than 5 , so your result is only for variants that have DP more than 5 in terms of false positive, false negative,..etc.. regarding annotation it's very important for detecting disease causes mutation, but you experiment for benchmark the variants not for detecting the disease causing mutation.

ADD REPLY
1
Entering edit mode
6 months ago

Hi yussab, few months ago i were working on the same project, but the NA12878 fastq files is generated in house from our sequencer facility. i followed GATK best practice (bwa mem, picard tools and GATK haplotypecaller) and i use our bed file and VCF file with GIAB VCF and High_Conf.bed. to benchmark the variants by both rtg tools and Hap.py Hap.py use vcfeval from rtg). and the result from both is same. lastly i made a bash script to process samples in batch by applying for loop into fastq files. Regarding Garvan samples it's ok to used them but i did not. happy to share my experience with you !, but which reference genome are going to use ? are going to include the alt, unplaced and unlocalized count ? are going to use bwa for alignments or other aligner ? gatk haplotypecaller or samtools mpileup ? what is you filtering createria ? and the most important question why you want to do this ?

ADD COMMENT
0
Entering edit mode

Thank you for your kind answer. For the moment I'm not gonna include alt, unplaced and unlocalized count I'm using bwa and gatk haplotypecaller following the gatk best practices. I'm doing this analysis for learning purpose, I wanted to start from GIAB fastq down to their VCF.

ADD REPLY

Login before adding your answer.

Traffic: 1916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6