We are a group of ~ 20 rising sophomore and juniors that are interested in group learning of new and interesting concepts in genetics and genomics. We seek your help with answers to the following questions please, about analyzing human genome sequences.
A little context: One student in our group has a Senegalese father and a Japanese mother. They had their genomes sequences 30X coverage shotgun, and very generously shared their data as the following filetypes with us - FASTQ, CRAM, CRAI, VCF and TBI.
Our questions are:
Is there a detailed tutorial you would recommend that can we use to predict disease states, by comparing VCF file (given to us) versus ClinVar database? Is it possible to do this via locally installed software and database(s)?
Does having parents with different ethnicities complicate use and/or interpretation of ClinVar database?
Is there a detailed tutorial on how to convert CRAM file to genome sequence? This would require us to know which reference was used to align to, in order to convert alignments back to sequences, right?
For human NGS - Illumina based FASTQ sequences, is there a standard pipeline for de novo genome assembly without a reference? If yes, then please share link(s) and tutorials. Thank you.
For any given assembled human genome, is there a standard pipeline for genome annotation? If yes, then please share link(s) and tutorials. Thanks again.
Through some postdocs we know, we have access to some HPCC accounts, so we can run >10cpus at a time, with > 100GB memory.
Thanks in advance for your advice, suggestions and sharing relevant links to software and tutorials.