variant calling and filtering
0
0
Entering edit mode
6 months ago
Paul • 0

hello to all the community, I am running variant calling and filtering, using as input, creole pig genomes, together with the reference genome: GCF_000003025.6_Sscrofa11.1_genomic.fna. It is the first time that I work, basically I do it in a self-taught way, I reviewed bibliography about it and I was able to channel the workflow: 1)QC Quality 2)Trimming (TrimGalore) 3)Mapping (BWA) 4)Call of Variant (SAMTools) 5)Filtering (BCFTools/VCFTools)

This is my script that I am running through SLURM on a HPC server:

#!/bin/bash
#SBATCH --job-name=bcf
#SBATCH --nodes=1
#SBATCH -c 120
#SBATCH --mem=1T  
#SBATCH -p sm                     
#SBATCH -o bcf_%j.out    
#SBATCH -e bcf_%j.err

# Load required modules
module load bcftools/1.21
module load vcftools/0.1.16

# Variables
REF_GENOME="/home/igbi/investigadores/paul.fernandez/genomas_pig/index/GCF_000003025.6_Sscrofa11.1_genomic.fna"
BAM_DIR="/home/igbi/investigadores/paul.fernandez/genomas_pig/BAM"
BAM_FILE="$BAM_DIR/*.sorted.bam"  # Para todo los archivos
VCF_OUTPUT="$BAM_DIR/variants.vcf"
FILTERED_VCF="$BAM_DIR/filtered_variants.vcf.gz"

# Call SNPs and generate VCFs
bcftools mpileup -f $REF_GENOME $BAM_FILE | \
bcftools call -mv -Ov -o $VCF_OUTPUT

# Compress the VCF file
bgzip $VCF_OUTPUT
VCF_OUTPUT_GZ="$VCF_OUTPUT.gz"

# Eliminate INDELs
bcftools filter -e 'TYPE="indel"' $VCF_OUTPUT_GZ -o $BAM_DIR/without_indels.vcf

# Filter variants
bcftools filter -s LowQual -e 'QUAL<20 || DP<10 || AO<3' $BAM_DIR/without_indels.vcf -o $FILTERED_VCF

echo "Process completed"

However, the process takes 3 days in the variant call, I would like to know if it is in the logic of the process and the delay is due to the volume of information. The ultimate goal is to build a phylogeny from the filtered SNPs.

the file being generated is: "variants.vcf" and has a weight of 101,169,152 KB

Thank you for your response.

VCFTools Samtools BCFtools • 674 views
ADD COMMENT
0
Entering edit mode

However, the process takes 3 days in the variant call

split by chromosome/region, run in parallel, use a cluster manager like snakemake/nextflow, use an uncompressed BCF format for the intermediate files.

ADD REPLY
0
Entering edit mode

thank you very much for the answer, I have the nextflow module installed on the server, could you guide me on how to work it? Do you mean to channel the process through nextflow? Thank you

ADD REPLY
0
Entering edit mode

something like this, not tested

workflow {
  ch1 = Channel.of([file(params.vcf),  file(params.vcf_idx)])
 contigs  = CONTIGS(ch1)
 ch2 = FILTER(contigs.output.splitText().map{it.trim()}.combine(ch1))
 MERGE(ch2.collect())
}

process CONTIGS {
input:
     tuple path(vcf),path(vcf_idx)
output:
    path("contigs.txt"),emit:output
script:
"""
bcftools index -s "${vcf}" | cut -f1 > contigs.txt
"""
}

process FILTER {
input:
     tuple val(contig),path(vcf),path(vcf_idx)
output:
    path("'${contig}.out.*"),emit:output
script:
"""
bcftools view -o u  "${vcf}" "${contig}" | .blablalabla | bcftools view --write-index -O b '${contig}.out.bcf"
"""
}

process MERGE {
input:
    path("INPUT/*")
script:
"""
## something with bcftools concat
"""
}
ADD REPLY
0
Entering edit mode

alternatively you can submit an array job to the SLURM cluster , where each job is a chromosome/region as Pierre Lindenbaum suggests.

This will only require minimal changes to the script you already have.

ADD REPLY

Login before adding your answer.

Traffic: 4527 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6