Question

Read Classification from ONT

0

Entering edit mode

27 days ago

maevalefeuvre • 0

Good morning,

I have multiple fastq.gz file generated using ONT and belonging to one sample. Each fastq.gz file is composed of many reads, and each read could belong to gene 1 or gene 2, or gene 3.

I have the reference sequence for each gene.

I want to classify each read and determining if it belongs to gene 1, gene 2, or gene 3.

The goal is to generate three fastq files: one fastq.gz file for all the reads that have gene 1, a second fastq file for all the reads that have gene 2, and a third fastq file for all the reads that have gene 3.

I am considering using:

- 1/ minimap2 to align the reads from each FASTQ file to the reference sequences for gene 1, gene 2, and gene 3 separately.

 - 2/ Converting the resulting SAM to BAM files to FASTQ format using samtools, filtering the reads based on their alignment to each gene.
 - 3/ Compress the resulting FASTQ files to FASTQ.gz format.

This is the script that I'm managing to use :

#!/bin/bash

#### Define input and output folders
input_dir="/path/to/input/fastq_files"
output_dir="/path/to/output"

#### path to reference sequences for gene 1, gene 2, and gene 3
gene1_ref="/path/to/reference_gene1.fasta"
gene2_ref="/path/to/reference_gene2.fasta"
gene3_ref="/path/to/reference_gene3.fasta"

#### Function to align reads to references and extract reads for each gene seperately
classify_reads() {
    local input_fastq="$1"
    local output_prefix="$2"

    ### align reads to reference sequences
    minimap2 -ax map-ont "$gene1_ref" "$input_fastq" > "$output_prefix"_gene1.sam
    minimap2 -ax map-ont "$gene2_ref" "$input_fastq" > "$output_prefix"_gene2.sam
    minimap2 -ax map-ont "$gene3_ref" "$input_fastq" > "$output_prefix"_gene3.sam

    ### Convert SAM to BAM and sort
    samtools view -Sb "$output_prefix"_gene1.sam | samtools sort -o "$output_prefix"_gene1.sorted.bam
    samtools view -Sb "$output_prefix"_gene2.sam | samtools sort -o "$output_prefix"_gene2.sorted.bam
    samtools view -Sb "$output_prefix"_gene3.sam | samtools sort -o "$output_prefix"_gene3.sorted.bam

    ### Converting BAM to FASTQ
    samtools fastq "$output_prefix"_gene1.sorted.bam | gzip > "$output_prefix"_gene1.fastq.gz
    samtools fastq "$output_prefix"_gene2.sorted.bam | gzip > "$output_prefix"_gene2.fastq.gz
    samtools fastq "$output_prefix"_gene3.sorted.bam | gzip > "$output_prefix"_gene3.fastq.gz

    ### delete intermediate files
    rm "$output_prefix"_gene*.sam
    rm "$output_prefix"_gene*.sorted.bam
}

### Process each input FASTQ file
for fastq_file in "$input_dir"/*.fastq.gz; do
    filename=$(basename "$fastq_file" .fastq.gz)
    output_prefix="$output_dir"/"$filename"

    classify_reads "$fastq_file" "$output_prefix"
done

what do you thing about my strategy ? does it correct ?

classification reads nanopore • 345 views

ADD COMMENT • link updated 27 days ago by dthorbur ★ 1.9k • written 27 days ago by maevalefeuvre • 0

score 1 · Answer 1 · 2024-04-05

1

Entering edit mode

27 days ago

dthorbur ★ 1.9k

Are these DNA or RNA samples? If they are the latter, you could use a standard RNAseq pipeline and toolset. If DNA, then your proposal could work, but you could streamline easily. The difficulty could arise depending on how similar genes are.

Firstly, and especially if they are similar reference sequences, you should add all reference sequences to the same file and map altogether, splitting by sample/replicate instead. This stops multiple alignment matches when you parse the output, especially for smaller fragments and/or if the the reference sequences are similar.

There are several easier ways. I don't understand why you go back to fastq if all you want is counts. You could parse the .sam alignment to find where each read is mapping. But tools like blast+ or use mmseqs easy-rbh, docs here, which will identify the best reciprocal best hit for each of your nanopore reads based on your reference sequences.

ADD COMMENT • link 27 days ago by dthorbur ★ 1.9k

0

Entering edit mode

thank you very much for all these informations,

but I need to generate 3 fastq.gz files , one file per gene.

based on this constat : my first strategy is complete wrong or I can use it in order to generates these 3 files?

ADD REPLY • link 27 days ago by maevalefeuvre • 0

1

Entering edit mode

You can use any of the methods indicated, and then split the original fastq based on what sequences hit each of the references. An adaptation of your initial method would work, but will be relatively slow and clunky compared to the mmseqs idea. But if you're not processing a lot of data it wouldn't make much difference overall. If you are going to adapt your original idea, merge all reference sequences into a single file to avoid the only major problem I see.

ADD REPLY • link 27 days ago by dthorbur ★ 1.9k