I have assembled paired-end reads (illumina strand-specific) using trinity with some different settings, and then merged these assemblies and done some filtering. I now want to assemble the un-mapped reads (pairs that do not map concordantly), and I therefore mapped the reads to the assembly using bowtie2/2.2.9.
This was the command:
(bowtie2 -p 16 --un-conc-gz all_unmapped_R%.fq.gz --no-unal -k 20 -x merged3_may2018_corset.fasta -q -1 all_R1.fq -2 all_R2.fq | samtools view -@16 -Sb -o merged3_may2018_corset_bowtie2.bam) 3>&1 1>&2 2>&3 | tee merged3_may2018_corset.align_stats
Which gave the following statistics:
954299926 reads; of these: 954299926 (100.00%) were paired; of these: **211582978** (22.17%) aligned concordantly 0 times 185228876 (19.41%) aligned concordantly exactly 1 time 557488072 (58.42%) aligned concordantly >1 times ---- 211582978 pairs aligned concordantly 0 times; of these: 11851194 (5.60%) aligned discordantly 1 time ---- 199731784 pairs aligned 0 times concordantly or discordantly; of these: 399463568 mates make up the pairs; of these: 196662163 (49.23%) aligned 0 times 48139061 (12.05%) aligned exactly 1 time 154662344 (38.72%) aligned >1 times 89.70% overall alignment rate
Based on this, I thought there were 211,582,978 read pairs that did not align concordantly, and that these pairs would be separated into an R1 and an R2 file, thus with 211,582,978 reads in each.
However, when counting the lines in the files and dividing by four, I get that there are 620,381,610 reads in R1 and 620,027,677 in R2...
So not only are there three times more un-mapped reads than I thought, and they have not been output as proper pairs :-/
Obviously, it makes no sense to assemble these un-mapped reads when they constitute 2/3 of all the reads...
Does anyone have any idea why there is this discrepancy? And how do I solve it?