hello to all the community, I am running variant calling and filtering, using as input, creole pig genomes, together with the reference genome: GCF_000003025.6_Sscrofa11.1_genomic.fna. It is the first time that I work, basically I do it in a self-taught way, I reviewed bibliography about it and I was able to channel the workflow: 1)QC Quality 2)Trimming (TrimGalore) 3)Mapping (BWA) 4)Call of Variant (SAMTools) 5)Filtering (BCFTools/VCFTools)
This is my script that I am running through SLURM on a HPC server:
#!/bin/bash
#SBATCH --job-name=bcf
#SBATCH --nodes=1
#SBATCH -c 120
#SBATCH --mem=1T
#SBATCH -p sm
#SBATCH -o bcf_%j.out
#SBATCH -e bcf_%j.err
# Load required modules
module load bcftools/1.21
module load vcftools/0.1.16
# Variables
REF_GENOME="/home/igbi/investigadores/paul.fernandez/genomas_pig/index/GCF_000003025.6_Sscrofa11.1_genomic.fna"
BAM_DIR="/home/igbi/investigadores/paul.fernandez/genomas_pig/BAM"
BAM_FILE="$BAM_DIR/*.sorted.bam" # Para todo los archivos
VCF_OUTPUT="$BAM_DIR/variants.vcf"
FILTERED_VCF="$BAM_DIR/filtered_variants.vcf.gz"
# Call SNPs and generate VCFs
bcftools mpileup -f $REF_GENOME $BAM_FILE | \
bcftools call -mv -Ov -o $VCF_OUTPUT
# Compress the VCF file
bgzip $VCF_OUTPUT
VCF_OUTPUT_GZ="$VCF_OUTPUT.gz"
# Eliminate INDELs
bcftools filter -e 'TYPE="indel"' $VCF_OUTPUT_GZ -o $BAM_DIR/without_indels.vcf
# Filter variants
bcftools filter -s LowQual -e 'QUAL<20 || DP<10 || AO<3' $BAM_DIR/without_indels.vcf -o $FILTERED_VCF
echo "Process completed"
However, the process takes 3 days in the variant call, I would like to know if it is in the logic of the process and the delay is due to the volume of information. The ultimate goal is to build a phylogeny from the filtered SNPs.
the file being generated is: "variants.vcf" and has a weight of 101,169,152 KB
Thank you for your response.
split by chromosome/region, run in parallel, use a cluster manager like snakemake/nextflow, use an uncompressed BCF format for the intermediate files.
thank you very much for the answer, I have the nextflow module installed on the server, could you guide me on how to work it? Do you mean to channel the process through nextflow? Thank you
something like this, not tested
alternatively you can submit an array job to the SLURM cluster , where each job is a chromosome/region as Pierre Lindenbaum suggests.
This will only require minimal changes to the script you already have.