I'm trying to filter out host reads from my shotgun metagenomics sequencing data using the Bowtie2/Samtools approach. After trimming my sequences with Trimmomatic, I ran this code using my institution's cluster, and found that the resulting fastq.gz files are LARGER than the starting ones? Shouldn't they be smaller since you're removing reads? Is there something wrong with my code?
The code in question:
module load bowtie/2.5.0
module load samtools/1.11
mkdir mouse_removal
mkdir mouse_removal/mapped_and_unmapped
mkdir mouse_removal/bothReadsUnmapped
mkdir mouse_removal/bothReadsUnmapped_sorted
NUM_THREADS=8
for sample in `awk '{print $1}' /scratch/g/jkirby/Matt_WGS_analysis_LK/Matt_samples.txt`
do
if [ "$sample" == "sample" ]; then continue; fi
R1s=`ls /scratch/g/jkirby/Matt_WGS_analysis_LK/01_QC_IU/$sample*QUALITY_PASSED_R1* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`
R2s=`ls /scratch/g/jkirby/Matt_WGS_analysis_LK/01_QC_IU/$sample*QUALITY_PASSED_R2* | python -c 'import sys; print(",".join([x.strip() for x in sys.stdin.readlines()]))'`
echo $R1s
echo $R2s
echo $sample
bowtie2 \
--threads $NUM_THREADS \
-x /scratch/g/jkirby/Landon_WGS_reanalysis/Mouse_C57BL_6NJ/mouse_C57BL_6NJ \
-1 $R1s -2 $R2s \
-S mouse_removal/mapped_and_unmapped/$sample.sam
samtools view -bS mouse_removal/mapped_and_unmapped/$sample.sam > mouse_removal/mapped_and_unmapped/$sample.bam
samtools view -b \
-f 12 \
-F 256 mouse_removal/mapped_and_unmapped/$sample.bam > mouse_removal/bothReadsUnmapped/$sample.bam
samtools sort -n \
-m 5G \
-@ 2 mouse_removal/bothReadsUnmapped/$sample.bam -o mouse_removal/bothReadsUnmapped_sorted/$sample.bam
samtools fastq \
-@ 8 \
mouse_removal/bothReadsUnmapped_sorted/$sample.bam \
-1 01_QC_IU_nomouse/$sample_host_removed_R1.fastq.gz \
-2 01_QC_IU_nomouse/$sample_host_removed_R2.fastq.gz \
-0 /dev/null \
-s /dev/null \
-n
done
safer:
samtools view | samtools view | samtools sort | samtools fastq