Question: bowtie2 unmapped reads
0
gravatar for Sjannie Lefevre
16 months ago by
Department of Biosciences, University of Oslo, Norway
Sjannie Lefevre30 wrote:

Hi :)

I have assembled paired-end reads (illumina strand-specific) using trinity with some different settings, and then merged these assemblies and done some filtering. I now want to assemble the un-mapped reads (pairs that do not map concordantly), and I therefore mapped the reads to the assembly using bowtie2/2.2.9.

This was the command:

(bowtie2 -p 16 --un-conc-gz all_unmapped_R%.fq.gz --no-unal -k 20 -x merged3_may2018_corset.fasta -q -1 all_R1.fq -2 all_R2.fq | samtools view -@16 -Sb -o merged3_may2018_corset_bowtie2.bam) 3>&1 1>&2 2>&3 | tee merged3_may2018_corset.align_stats

Which gave the following statistics:

954299926 reads; of these:
  954299926 (100.00%) were paired; of these:
    **211582978** (22.17%) aligned concordantly 0 times
    185228876 (19.41%) aligned concordantly exactly 1 time
    557488072 (58.42%) aligned concordantly >1 times
    ----
    211582978 pairs aligned concordantly 0 times; of these:
      11851194 (5.60%) aligned discordantly 1 time
    ----
    199731784 pairs aligned 0 times concordantly or discordantly; of these:
      399463568 mates make up the pairs; of these:
        196662163 (49.23%) aligned 0 times
        48139061 (12.05%) aligned exactly 1 time
        154662344 (38.72%) aligned >1 times
89.70% overall alignment rate

Based on this, I thought there were 211,582,978 read pairs that did not align concordantly, and that these pairs would be separated into an R1 and an R2 file, thus with 211,582,978 reads in each.

However, when counting the lines in the files and dividing by four, I get that there are 620,381,610 reads in R1 and 620,027,677 in R2...

So not only are there three times more un-mapped reads than I thought, and they have not been output as proper pairs :-/

Obviously, it makes no sense to assemble these un-mapped reads when they constitute 2/3 of all the reads...

Does anyone have any idea why there is this discrepancy? And how do I solve it?

Cheers Sjannie

assembly rna-seq bowtie2 • 1.3k views
ADD COMMENTlink modified 16 months ago by Devon Ryan92k • written 16 months ago by Sjannie Lefevre30
3

Apparantly, it is an issue with the -k > 1 and the slightly old version (2.2.9) of bowtie2 that I am using: https://github.com/BenLangmead/bowtie2/issues/37

ADD REPLYlink modified 16 months ago • written 16 months ago by Sjannie Lefevre30

However, when counting the lines in the files and dividing by four, I get that there are 620,381,610 reads in R1 and 620,027,677 in R2...

Are you counting the lines of the gzipped file ? If it is compressed, then the number of lines is not directly related to the number of reads. To make sure you are counting right, you can also use fastqc on your .fq.gz files.

ADD REPLYlink written 16 months ago by Carlo Yague4.7k

I counted the lines of the unzipped file. When I do the same for the original file, I get the correct number of reads, 954,299,926, for both R1 and R2. So I do not think that the counting method is the problem. Also, the files with all reads are 273Gb, while the files with the un-mapped reads are 177-178Gb, which is what alerted be to the problem in the first place.

But thanks anyway :)

ADD REPLYlink modified 16 months ago • written 16 months ago by Sjannie Lefevre30

Here is the stats reported by samtools flagstat:

8,099,514,442 + 0 in total (QC-passed reads + QC-failed reads)
6,387,576,753 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
8,099,514,442 + 0 mapped (100.00% : N/A)
1,711,937,689 + 0 paired in sequencing
857,336,664 + 0 read1
854,601,025 + 0 read2
1,485,433,896 + 0 properly paired (86.77% : N/A)
1,660,538,790 + 0 with itself and mate mapped
51,398,899 + 0 singletons (3.00% : N/A)
132,778,294 + 0 with mate mapped to a different chr
48,923,368 + 0 with mate mapped to a different chr (mapQ>=5)

954,299,926 minus 1,485,433,896 / 2 = 211,582,978, which is the number given by the bowtie stats also (it would have been surprising of course if they differede...).

I will try and re-run the bowtie2 command without --no-unal, and then capture unmapped reads from the bam file, using -F 2 to exclude only properly paired alignments. Based on the samtools stats, that really should give the correct number of unmapped reads - right?

Though I still do not understand why it is not the correct number from bowtie2 in the first place...

ADD REPLYlink modified 16 months ago • written 16 months ago by Sjannie Lefevre30

Based on the samtools stats, that really should give the correct number of unmapped reads - right?

It would be interesting to compare the number of unmapped reads (-F 4) and the number of not properly aligned pairs (-F 2).

ADD REPLYlink written 16 months ago by Carlo Yague4.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2251 users visited in the last hour