First, I have several Aspergillus flavus (A kind of fungi species) illumina sequencing raw data as pair of fastq.gz file (
sample1_filtered_2.fastq.gz). And I wanted to assemble illumina fragment sequences and make SNP(single nucleotide polymorphism) analysis with the reference genome, Aspergillus flavus NRRL3357 as fasta file. At the end of the analysis, vcf (variant calling file) file had to be generated. So I made the pipeline based on old book(2018) and help of chatGPT.
Here is the command line I consisted.
1. Make index of reference genome.
bwa index -a bwtsw NRRL3357.fa
2. Assemble illumina fragments with reference genome and generate SAM format file.
bwa mem -R "@RG\tID:BP2-1\tSM:BP2-1\tPL:ILLUMINA\tPI:330" NRRL3357.fa BP2-1_filtered_1.fastq.gz BP2-1_filtered_2.fastq.gz > BP2-1.sam
3. Convert SAM format into BAM format.
samtools view -bhS BP2-1.sam > BP2-1.bam
4. Sorting and mark duplicated fragments
PICARD SortSam \ I=BP2-1.bam \ O=sorted_BP2-1.bam \ SO=coordinate PICARD MarkDuplicates \ I=sorted_BP2-1.bam \ O=dedup_sorted_BP2-1.bam \ METRICS_FILE=duplicates \ REMOVE_DUPLICATES=true \ CREATE_INDEX=True
5. Make reference genome dictionary for GATK.
PICARD CreateSequenceDictionary \ R=NRRL3357.fa \ O=NRRL3357.dict
6. Make reference genome index for GATK
samtools faidx NRRL3357.fa
7. Comparing BAM file of sample and reference genome fasta and generate VCF file.
gatk HaplotypeCaller \ -R NRRL3357.fa \ -I dedup_sorted_BP2-1.bam \ -O BP2-1.vcf -ERC GVCF -stand_call_conf 30.0
8. VCF filtering process.
gatk VariantFiltration \ -V 100-8.vcf \ --filter-expression "QD < 2.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || FS > 60.0 || SOR > 3.0" \ --filter-name "my_snp_filter" \ -O 100-8_filtered.vcf
The programs worked with no error, but some warning message. And the running time was too short and final VCF files contents were just like this.
#CHROM, POS, ID, RED, ALT, QUAL, FILTER, INFO, FORMAT, BP2-1 NC 054691.1, 1, -, T, <NON_REF>, -, -, END=4 GT:DP:GO:MIN_DP:PL, 0/0:1:0:1:0,0,0
The VCF file I wanted was like this.
#CHROM, POS, ID, RED, ALT, QUAL, FILTER, INFO, FORMAT, BP2-1 NC_036435.1, 330, -, A, G, 213, -, DP=226;VDB=0.990697;SGB=164.624;RPB=0.42072;MQB=1;MQSB=1;BQB=0.0015177;MQ0F=0;ICB=0.416667;HOB=0.375;AC=2;AN=8;DP4=110,29,66,12;MQ=60, GT:PL:DP4, ./.:0,0,0:0,0,0,0
So what is the major problem with my command-line?
I've formatted your post so it looks better. Please check to make sure nothing is lost in translation. - I removed a few backslashes from the 2 VCF blocks and the CreateSequenceDictionary block.
Also, how well do you know variant calling? "Old book + ChatGPT" gives us no idea of your level of knowledge.
Thanks for your help :)
I read basic Genome data analysis book for beginner and I just followed the command-line to generate VCF file, but there was lot of error, so I got assistance from chatGPT. That's all I can explain.