Question

variant calling and filtering

0

Entering edit mode

6 months ago

Paul • 0

hello to all the community, I am running variant calling and filtering, using as input, creole pig genomes, together with the reference genome: GCF_000003025.6_Sscrofa11.1_genomic.fna. It is the first time that I work, basically I do it in a self-taught way, I reviewed bibliography about it and I was able to channel the workflow: 1)QC Quality 2)Trimming (TrimGalore) 3)Mapping (BWA) 4)Call of Variant (SAMTools) 5)Filtering (BCFTools/VCFTools)

This is my script that I am running through SLURM on a HPC server:

#!/bin/bash
#SBATCH --job-name=bcf
#SBATCH --nodes=1
#SBATCH -c 120
#SBATCH --mem=1T  
#SBATCH -p sm                     
#SBATCH -o bcf_%j.out    
#SBATCH -e bcf_%j.err

# Load required modules
module load bcftools/1.21
module load vcftools/0.1.16

# Variables
REF_GENOME="/home/igbi/investigadores/paul.fernandez/genomas_pig/index/GCF_000003025.6_Sscrofa11.1_genomic.fna"
BAM_DIR="/home/igbi/investigadores/paul.fernandez/genomas_pig/BAM"
BAM_FILE="$BAM_DIR/*.sorted.bam"  # Para todo los archivos
VCF_OUTPUT="$BAM_DIR/variants.vcf"
FILTERED_VCF="$BAM_DIR/filtered_variants.vcf.gz"

# Call SNPs and generate VCFs
bcftools mpileup -f $REF_GENOME $BAM_FILE | \
bcftools call -mv -Ov -o $VCF_OUTPUT

# Compress the VCF file
bgzip $VCF_OUTPUT
VCF_OUTPUT_GZ="$VCF_OUTPUT.gz"

# Eliminate INDELs
bcftools filter -e 'TYPE="indel"' $VCF_OUTPUT_GZ -o $BAM_DIR/without_indels.vcf

# Filter variants
bcftools filter -s LowQual -e 'QUAL<20 || DP<10 || AO<3' $BAM_DIR/without_indels.vcf -o $FILTERED_VCF

echo "Process completed"

However, the process takes 3 days in the variant call, I would like to know if it is in the logic of the process and the delay is due to the volume of information. The ultimate goal is to build a phylogeny from the filtered SNPs.

the file being generated is: "variants.vcf" and has a weight of 101,169,152 KB

Thank you for your response.

VCFTools Samtools BCFtools • 674 views

ADD COMMENT • link updated 6 months ago by lieven.sterck 15k • written 6 months ago by Paul • 0

0

Entering edit mode

However, the process takes 3 days in the variant call

split by chromosome/region, run in parallel, use a cluster manager like snakemake/nextflow, use an uncompressed BCF format for the intermediate files.

ADD REPLY • link 6 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

thank you very much for the answer, I have the nextflow module installed on the server, could you guide me on how to work it? Do you mean to channel the process through nextflow? Thank you

ADD REPLY • link 6 months ago by Paul • 0

0

Entering edit mode

something like this, not tested

workflow {
  ch1 = Channel.of([file(params.vcf),  file(params.vcf_idx)])
 contigs  = CONTIGS(ch1)
 ch2 = FILTER(contigs.output.splitText().map{it.trim()}.combine(ch1))
 MERGE(ch2.collect())
}

process CONTIGS {
input:
     tuple path(vcf),path(vcf_idx)
output:
    path("contigs.txt"),emit:output
script:
"""
bcftools index -s "${vcf}" | cut -f1 > contigs.txt
"""
}

process FILTER {
input:
     tuple val(contig),path(vcf),path(vcf_idx)
output:
    path("'${contig}.out.*"),emit:output
script:
"""
bcftools view -o u  "${vcf}" "${contig}" | .blablalabla | bcftools view --write-index -O b '${contig}.out.bcf"
"""
}

process MERGE {
input:
    path("INPUT/*")
script:
"""
## something with bcftools concat
"""
}

ADD REPLY • link 6 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

alternatively you can submit an array job to the SLURM cluster , where each job is a chromosome/region as Pierre Lindenbaum suggests.

This will only require minimal changes to the script you already have.

ADD REPLY • link 6 months ago by lieven.sterck 15k