more reads in metagenomic samples after 'removing host reads'
1
0
Entering edit mode
7 months ago
sapuizait ▴ 10

the title says it all - this is driving me nuts

I have Illumina shotgun data from cow manure and I am trying to remove the cow reads.

I have downloaded the latest cow genome and used it as reference to map any cow reads and keep everything that doesn't map.

I tried 2 methods:

#1 directly with Bowtie2
$bowtie2 -p 16 -x /cow_genome_db -1 *_R1_001.trim.fq.gz -2 *_R2_001.trim.fq.gz --un-conc reads_not_host
$ mv reads_not_host.1 reads_not_host.1.fq
$ mv reads_not_host.2 reads_not_host.2.fq

#2 using Bowtie2 and samtools
$bowtie2 -p 16 -x cow_genome_db -1 *_R1_001.trim.fq.gz -2 *_R2_001.trim.fq.gz -S ALL.sam
$samtools view -bS ALL.sam > ALL.bam
$samtools view -b -f 12 -F 256 ALL.bam > both_unmapped.bam
$samtools sort -n -m 5G -@ 2 both_unmapped.bam -o both_unmapped_sorted.bam
$samtools fastq -@ 8 both_unmapped_sorted.bam -1 SAMPLE_host_removed_R1.fastq.gz -2 SAMPLE_host_removed_R2.fastq.gz

According to the log

41565230 reads; of these:
  41565230 (100.00%) were paired; of these:
    41346602 (99.47%) aligned concordantly 0 times
    128863 (0.31%) aligned concordantly exactly 1 time
    89765 (0.22%) aligned concordantly >1 times
    ----
    41346602 pairs aligned concordantly 0 times; of these:
      3035 (0.01%) aligned discordantly 1 time
    ----
    41343567 pairs aligned 0 times concordantly or discordantly; of these:
      82687134 mates make up the pairs; of these:
        81906525 (99.06%) aligned 0 times
        7953 (0.01%) aligned exactly 1 time
        772656 (0.93%) aligned >1 times
1.47% overall alignment rate

But when I check the number of reads in each file before and after (wc -l /4) - there is more reads after removing the host!!!

2379013 before.fq.1.gz
2370650 before.fq.2.gz

using the first method

2410958 after.fq.1.gz
2407276 after.fq.2.gz

using the second method

2415432 after.fq.1.gz
2529372 after.fq.1.gz

So what is going on?

bowtie2 samtools • 1.1k views
ADD COMMENT
0
Entering edit mode

more reads in metagenomic samples

It is not logical to get more reads as you realize.

What you are likely looking at is presence of duplicate reads because of the secondary/supplementary alignments present in your BAM files. You can filter these reads using -F 2048 SAM flag to remove those duplicates (see Remove supplementary alignments from a bam file ) .

ADD REPLY
0
Entering edit mode

thanks, will give it a try and write back

ADD REPLY
0
Entering edit mode

OK,

Right after getting the unmapped.bam and before sorting the bam I added an extra step to remove duplicate reads

samtools view -b -F 2048 both_unmapped.bam > both_unmapped_new.bam

but I got the exact same number of reads in the end...

So either there weren't any duplicate alignments, or I didn't remove them?

ADD REPLY
0
Entering edit mode

Can you check

samtools view your.bam | awk -F "\t" '{print $1}' | uniq -c | sort -k1,1nr | less 

Most reads should have 2 counts. Those that have more than 2 are ones you need to investigate.

ADD REPLY
0
Entering edit mode

yeap, as far I can see all of them have exactly 2 counts (sorted.bam) - there is sth weird going on here but not sure what

ADD REPLY
0
Entering edit mode

If you are willing to try an alternative then I suggest you try using bbsplit.sh ( BBSplit syntax for generating builds for the reference genome and how to call different builds. ) or seal.sh (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/seal-guide/ ) from BBMap suite to bin the reads.

ADD REPLY
0
Entering edit mode

thanks for the suggestions - will try all this! At this point this is purely curiosity, I know the host reads are very very few that will only make a negligible difference but I d like to find out whats going on...

ADD REPLY
1
Entering edit mode
7 months ago
GenoMax 141k

Only thing I can think of is your (wc -l) estimate is probably incorrect.

Can you run bbduk.sh -Xmx2g in=before.fq.1.gz and also for after.fq.1.gz to check read numbers output in stats that should be written to screen? This operation is simply reading the file and not doing anything else.

ADD COMMENT
0
Entering edit mode

thats it - 'wc -l' does not offer a good estimate! according to 'bbduk' there are:

before

41565230 reads

6108475655 bases

after

40582315 reads

5978684848 bases

ADD REPLY

Login before adding your answer.

Traffic: 2612 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6