I have a whole genome resequencing Illumina reads from two contrasting genotypes.
I have few queries regarding GATK analysis.
Objective: I want to identify the homozygous SNP and Indels between these two genotypes by mapping raw read against the reference genome.
what are the prefiltering parameter need to take care before starting the GATK pipeline?
I already removed the adapter and low-quality bases from reads, do I need to remove repetitive reads also, if yes then please suggest how to do it? What are the other pre-read filtering parameter that also I should need to look?
In GATK pipeline why we are creating sequence dictionary? where is it used? What it the role of assign read group? how do I assign read group, does it has specific feature or just any random name I can put?
Create sequence dictionary
java -jar~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict
Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam
I have formatted your code correctly. In future use the icon shown below (after highlighting the text you want to format as code) when editing (Screenshot courtsey of @Wouter).
You are certainly boosting my citations, my supervisor will be pleased!
(but OP doesn't learn what we suggest about formatting so it's kinda useless)
If you take the time to search biostars you will find answers for all those questions.