File size increase after host read removal?
1
0
Entering edit mode
7 months ago
akazen • 0

I'm trying to filter out host reads from my shotgun metagenomics sequencing data using the Bowtie2/Samtools approach. After trimming my sequences with Trimmomatic, I ran this code using my institution's cluster, and found that the resulting fastq.gz files are LARGER than the starting ones? Shouldn't they be smaller since you're removing reads? Is there something wrong with my code?

The code in question:

module load bowtie/2.5.0
module load samtools/1.11

mkdir mouse_removal
mkdir mouse_removal/mapped_and_unmapped
mkdir mouse_removal/bothReadsUnmapped
mkdir mouse_removal/bothReadsUnmapped_sorted

NUM_THREADS=8

for sample in `awk '{print $1}' /scratch/g/jkirby/Matt_WGS_analysis_LK/Matt_samples.txt`
do
    if [ "$sample" == "sample" ]; then continue; fi

    R1s=`ls /scratch/g/jkirby/Matt_WGS_analysis_LK/01_QC_IU/$sample*QUALITY_PASSED_R1* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`
    R2s=`ls /scratch/g/jkirby/Matt_WGS_analysis_LK/01_QC_IU/$sample*QUALITY_PASSED_R2* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`

    echo $R1s
    echo $R2s
    echo $sample

    bowtie2 \
      --threads $NUM_THREADS \
      -x /scratch/g/jkirby/Landon_WGS_reanalysis/Mouse_C57BL_6NJ/mouse_C57BL_6NJ \
      -1 $R1s -2 $R2s \
      -S mouse_removal/mapped_and_unmapped/$sample.sam

    samtools view -bS mouse_removal/mapped_and_unmapped/$sample.sam > mouse_removal/mapped_and_unmapped/$sample.bam

    samtools view -b \
      -f 12 \
      -F 256 mouse_removal/mapped_and_unmapped/$sample.bam > mouse_removal/bothReadsUnmapped/$sample.bam

    samtools sort -n \
      -m 5G \
      -@ 2 mouse_removal/bothReadsUnmapped/$sample.bam -o mouse_removal/bothReadsUnmapped_sorted/$sample.bam

    samtools fastq \
      -@ 8 \
      mouse_removal/bothReadsUnmapped_sorted/$sample.bam  \
      -1 01_QC_IU_nomouse/$sample_host_removed_R1.fastq.gz  \
      -2 01_QC_IU_nomouse/$sample_host_removed_R2.fastq.gz  \
      -0 /dev/null \
      -s /dev/null \
      -n
done
metagenomics quality-control WGS • 1.2k views
ADD COMMENT
1
Entering edit mode
  • A few comments.
    ls /scratch/g/jkirby/Matt_WGS_analysis_LK/01_QC_IU/$sample*QUALITY_PASSED_R1* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`
    

safer:

find  /scratch/g/jkirby/Matt_WGS_analysis_LK/01_QC_IU/$sample*QUALITY_PASSED_R1 -maxdepth 1 -type f -name "*q.gz" | sort| paste -sd ','
  • you can reduce I/O by piping a tool into another

samtools view | samtools view | samtools sort | samtools fastq

  • samtools collate is faster than samtools sort if you want to use samtools fastq.
  • you should use a workflow manager like nextflow or snakemake
ADD REPLY
2
Entering edit mode
7 months ago

Shouldn't they be smaller since you're removing reads? Is there something wrong with my code?

could be a difference of compression. Check the number of reads rather than the size of the file

ADD COMMENT
0
Entering edit mode

I checked the number of reads using grep -c ">" 01_QC_IU/fasta.fastq.gz, and there were more reads after the host filtering so now I'm just as confused as ever

ADD REPLY
0
Entering edit mode

To count reads which was a question before:

Using seqkit (LINK) :

seqkit stats fastq.gz

Using bbmap (LINK) :

reformat.sh -Xmx2g in=file.fastq.gz
ADD REPLY
0
Entering edit mode

samtools fastq -@ 8 mouse_removal/bothReadsUnmapped_sorted/$sample.bam

You are using samtools fastq so that will generate reads in fastq format. If you need fasta formatted reads then you need to use samtools fasta.

ADD REPLY
0
Entering edit mode

fastq is what I want. I was just using fasta.fastq.gz as an example file name. Is the grep method I used incorrect?

ADD REPLY
0
Entering edit mode

Yes it is incorrect. You are looking for > which is the start of fasta header.

ADD REPLY
0
Entering edit mode

count number of lines in a compressed file:

gunzip -c 01_QC_IU/fasta.fastq.gz | wc -l
`
ADD REPLY
0
Entering edit mode

I think you're right about it being a difference of compression - both

gunzip -c 01_QC_IU/fasta.fastq.gz | wc -l `

and seqkit showed there being less reads in the host filtered fastqs. Thank you so much for your help, I'm very relieved!

ADD REPLY

Login before adding your answer.

Traffic: 4451 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6