I am validating an in-house pipeline for calling SNP and INDELS for small genomes. For this purpose I am using the GIAB NA12878 HiSeq 2500 300X coverage dataset. ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/
I have downloaded all fastq files from this folder and merged the forward reads for two lanes into a single fastq file and same with reverse reads.
- Do i need to download and merge reads from other folders as well? example from Sample_U0b, Sample_U0c and so on with Sample_U0a. Will files from only Sample_U0a give a coverage of 300X. I can not find any explanation whether to merge files from all Samples or just from one Sample.
The samples which I often deal with are 3000 - 12000 bp long ssDNA virus or dsDNA plasmid.
- Can I map the reads from NA12878 to a specific gene (same length as above) and generate a subset of reads mapped to that region, which I can use for further analysis of SNP and INDELs? Will this approach work for validation?
Previously I was using PhiX dataset from illumina for validation of pipeline the problem with that is it has only SNPs which are validated and not INDELs.
- Is there any other plasmid/viral datatset other than PhiX which I can use for validation? It should contain both SNP's and INDELs (10bp or more long) at different variant frequencies.
Thanks in advance!!