Question

File size increase after host read removal?

0

Entering edit mode

7 months ago

akazen • 0

I'm trying to filter out host reads from my shotgun metagenomics sequencing data using the Bowtie2/Samtools approach. After trimming my sequences with Trimmomatic, I ran this code using my institution's cluster, and found that the resulting fastq.gz files are LARGER than the starting ones? Shouldn't they be smaller since you're removing reads? Is there something wrong with my code?

The code in question:

module load bowtie/2.5.0
module load samtools/1.11

mkdir mouse_removal
mkdir mouse_removal/mapped_and_unmapped
mkdir mouse_removal/bothReadsUnmapped
mkdir mouse_removal/bothReadsUnmapped_sorted

NUM_THREADS=8

for sample in `awk '{print $1}' /scratch/g/jkirby/Matt_WGS_analysis_LK/Matt_samples.txt`
do
    if [ "$sample" == "sample" ]; then continue; fi

    R1s=`ls /scratch/g/jkirby/Matt_WGS_analysis_LK/01_QC_IU/$sample*QUALITY_PASSED_R1* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`
    R2s=`ls /scratch/g/jkirby/Matt_WGS_analysis_LK/01_QC_IU/$sample*QUALITY_PASSED_R2* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`

    echo $R1s
    echo $R2s
    echo $sample

    bowtie2 \
      --threads $NUM_THREADS \
      -x /scratch/g/jkirby/Landon_WGS_reanalysis/Mouse_C57BL_6NJ/mouse_C57BL_6NJ \
      -1 $R1s -2 $R2s \
      -S mouse_removal/mapped_and_unmapped/$sample.sam

    samtools view -bS mouse_removal/mapped_and_unmapped/$sample.sam > mouse_removal/mapped_and_unmapped/$sample.bam

    samtools view -b \
      -f 12 \
      -F 256 mouse_removal/mapped_and_unmapped/$sample.bam > mouse_removal/bothReadsUnmapped/$sample.bam

    samtools sort -n \
      -m 5G \
      -@ 2 mouse_removal/bothReadsUnmapped/$sample.bam -o mouse_removal/bothReadsUnmapped_sorted/$sample.bam

    samtools fastq \
      -@ 8 \
      mouse_removal/bothReadsUnmapped_sorted/$sample.bam  \
      -1 01_QC_IU_nomouse/$sample_host_removed_R1.fastq.gz  \
      -2 01_QC_IU_nomouse/$sample_host_removed_R2.fastq.gz  \
      -0 /dev/null \
      -s /dev/null \
      -n
done

metagenomics quality-control WGS • 1.2k views

ADD COMMENT • link updated 7 months ago by Ram 45k • written 7 months ago by akazen • 0

1

Entering edit mode

A few comments.

ls /scratch/g/jkirby/Matt_WGS_analysis_LK/01_QC_IU/$sample*QUALITY_PASSED_R1* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`

safer:

find  /scratch/g/jkirby/Matt_WGS_analysis_LK/01_QC_IU/$sample*QUALITY_PASSED_R1 -maxdepth 1 -type f -name "*q.gz" | sort| paste -sd ','

you can reduce I/O by piping a tool into another

samtools view | samtools view | samtools sort | samtools fastq

samtools collate is faster than samtools sort if you want to use samtools fastq.
you should use a workflow manager like nextflow or snakemake

ADD REPLY • link 7 months ago by Pierre Lindenbaum 166k

score 2 · Answer 1 · 2025-01-22

2

Entering edit mode

7 months ago

Pierre Lindenbaum 166k

Shouldn't they be smaller since you're removing reads? Is there something wrong with my code?

could be a difference of compression. Check the number of reads rather than the size of the file

ADD COMMENT • link 7 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

I checked the number of reads using grep -c ">" 01_QC_IU/fasta.fastq.gz, and there were more reads after the host filtering so now I'm just as confused as ever

ADD REPLY • link 7 months ago by akazen • 0

0

Entering edit mode

To count reads which was a question before:

Using seqkit (LINK) :

seqkit stats fastq.gz

Using bbmap (LINK) :

reformat.sh -Xmx2g in=file.fastq.gz

ADD REPLY • link 7 months ago by GenoMax 153k

0

Entering edit mode

samtools fastq -@ 8 mouse_removal/bothReadsUnmapped_sorted/$sample.bam

You are using samtools fastq so that will generate reads in fastq format. If you need fasta formatted reads then you need to use samtools fasta.

ADD REPLY • link 7 months ago by GenoMax 153k

0

Entering edit mode

fastq is what I want. I was just using fasta.fastq.gz as an example file name. Is the grep method I used incorrect?

ADD REPLY • link 7 months ago by akazen • 0

0

Entering edit mode

Yes it is incorrect. You are looking for > which is the start of fasta header.

ADD REPLY • link 7 months ago by GenoMax 153k

0

Entering edit mode

count number of lines in a compressed file:

gunzip -c 01_QC_IU/fasta.fastq.gz | wc -l
`

ADD REPLY • link 7 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

I think you're right about it being a difference of compression - both

gunzip -c 01_QC_IU/fasta.fastq.gz | wc -l `

and seqkit showed there being less reads in the host filtered fastqs. Thank you so much for your help, I'm very relieved!

ADD REPLY • link 7 months ago by akazen • 0