Tutorial:Running fastq_screen on your data
0
2
Entering edit mode
8 months ago

Fastq Screen is a wonderful FASTQC tool that one can use to identify the source of contamination in their data. But lately, the configuration of the tools has turned out to be a nightmare with the addition of database failing recently. I will try to pen down a few steps I took to successfully configure the fastq_screen.conf file

  • Download the fastq_screen using conda/mamba
conda create -n fastq_screen
conda activate fastq_screen
conda install -c bioconda fastq-screen
which fastq_screen

My fastq_screen lives in miniconda3/envs/fastqscreen/bin/fastq_screen however my fastq_screen is a symlink here when I visit the bin folder.

The exemplary configuration file is present in miniconda3/envs/fastqscreen/share/fastq-screen-0.14.0-1/ with name fastq_screen.conf.example. I will make a copy of this file and name it fastq_screen.conf and start editing it.

  • When I download fastq_screen, bowtie and bowtie2 gets automatically downloaded. You can set the path of these tools by uncommenting them as follows:
BOWTIE  /miniconda3/envs/fastqscree/bin/bowtie
BOWTIE2 /miniconda3/envs/fastqscree/bin/bowtie2
BWA /sw/csi/bwa/0.7.17/el7_gnu6.4.0/bin/bwa

Since I am working on a cluster that already has bwa installed I didn't download it separately. I will load this module module load bwa each time I run fastq_screen to use it.

  • Now the part that involves database configuration is laborious. I had to download each organism separately and index them. I make a separate directory and keep my bwa indexes therein.
## Human - sequences available from
## ftp://ftp.ensembl.org/pub/current/fasta/homo_sapiens/dna/
DATABASE        Human   /path_to_indexes/GRCh38.primary_assembly.genome.fa
##
## Mouse - sequence available from
## ftp://ftp.ensembl.org/pub/current/fasta/mus_musculus/dna/
DATABASE        Mouse   /path_to_indexes_diectory/GRCm39/GRCm39.primary_assembly.genome.fa
##
## Ecoli- sequence available from EMBL accession U00096.2
DATABASE        Ecoli   /path_to_indexes_diectory/Ecoli/Ecoli.ASM160652v1.fasta
##
## PhiX - sequence available from Refseq accession NC_001422.1
DATABASE        PhiX    /path_to_indexes_diectory/PhiX/PhiX.fasta
##
## Adapters - sequence derived from the FastQC contaminats file found at: www.bioinformatics.babraham.ac.uk/projects/fastqc
DATABASE        Adapters        /path_to_indexes_diectory/Adapters/adapters.fasta
##
## Vector - Sequence taken from the UniVec database
## http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html
DATABASE        Vectors         /path_to_indexes_diectory/Vectors/UniVec.fasta
## Pvivax - Sequence taken from PlasmoDB
##  https://plasmodb.org/common/downloads/release-56/PvivaxP01/fasta/data/PlasmoDB-56_PvivaxP01_Genome.fasta
DATABASE        Pvivax  /path_to_indexes_diectory/Pvivax/PlasmoDB-56_PvivaxP01_Genome.fasta

Best place to download the GRCh38 and GRCm39 is Gencode. Some of the links from where I got the fasta files are

wget https://plasmodb.org/common/downloads/release-56/PvivaxP01/fasta/data/PlasmoDB-56_PvivaxP01_Genome.fasta
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M28/GRCm39.primary_assembly.genome.fa.gz
wget http://ftp.ensemblgenomes.org/pub/release-52/bacteria//fasta/bacteria_12_collection/escherichia_coli_gca_001606525/dna/Escherichia_coli_gca_001606525.ASM160652v1.dna.toplevel.fa.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec
  • Runing the analysis parallel on cluster using
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --mem=500GB
#SBATCH --partition=batch
#SBATCH --cpus-per-task 28
#SBATCH -J fastq_screen
#SBATCH -o fastq_screen.out
#SBATCH -e fastq_screen.err
#SBATCH --time=20:00:00
#SBATCH --mail-user=rohit.XXXXXX@gmai.com
#SBATCH --mail-type=ALL

module load bwa/0.7.17/gnu-6.4.0

## file contains filenames
cat file | parallel -j 8 "fastq_screen --aligner bwa {}"
fastqscreen • 908 views
ADD COMMENT
1
Entering edit mode

A general advise for these sorts of posts is to remove absolute paths and everything specific to your local infrastructure such as the slurm submission. New users will easily be confused with that and one anyway has to adapt things to local setups.

ADD REPLY
0
Entering edit mode

Sorry!! Improved as suggested.

ADD REPLY

Login before adding your answer.

Traffic: 1951 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6