Question: One library, two sequencing runs, two different alignment efficiencies?
1
gravatar for twrzes
3.7 years ago by
twrzes20
United Kingdom
twrzes20 wrote:

Dear BioStars Community,

I have a problem with an alignment to the transcriptome.

I have 8 RNA-Seq libraries sequenced on Illumina HiScanSQ system in one lane (2x100bp, paired-end) per sequencing run. These 8 libraries (1 pool) were put into two sequencing runs to obtain a decent number of reads. After demultiplexing (using bcl2fastq-1.8.4) the reads were trimmed using TrimGalore and aligned to the previously assembled transcriptome (because there is no reference genome for the organism - Pinus sylvestris - I am trying to analyze...) by Bowtie2-2.2.6. In the case of 7 libraries there were almost no difference in the alignment efficiency (~85-95%, with ~60-85% of uniquely mapped reads), but in case of one library something strange happened:

Run1:

11109859 reads; of these:
  11109859 (100.00%) were paired; of these:
    1701658 (15.32%) aligned concordantly 0 times
    6666961 (60.01%) aligned concordantly exactly 1 time
    2741240 (24.67%) aligned concordantly >1 times
    ----
    1701658 pairs aligned concordantly 0 times; of these:
      11078 (0.65%) aligned discordantly 1 time
    ----
    1690580 pairs aligned 0 times concordantly or discordantly; of these:
      3381160 mates make up the pairs; of these:
        3218192 (95.18%) aligned 0 times
        86430 (2.56%) aligned exactly 1 time
        76538 (2.26%) aligned >1 times
85.52% overall alignment rate

Run2:

14719563 reads; of these:
  14719563 (100.00%) were paired; of these:
    7641835 (51.92%) aligned concordantly 0 times
    4991995 (33.91%) aligned concordantly exactly 1 time
    2085733 (14.17%) aligned concordantly >1 times
    ----
    7641835 pairs aligned concordantly 0 times; of these:
      7874 (0.10%) aligned discordantly 1 time
    ----
    7633961 pairs aligned 0 times concordantly or discordantly; of these:
      15267922 mates make up the pairs; of these:
        15039673 (98.51%) aligned 0 times
        94443 (0.62%) aligned exactly 1 time
        133806 (0.88%) aligned >1 times
48.91% overall alignment rate

So my question is: what should I do to find out what went wrong? I excluded (maybe too soon...) the human error because this was the same pool used for two runs (from one Eppendorf tube).

I also did FastQC on demultiplexed and trimmed reads - links for this library with low alignment efficiency are provided below:

Run1 (the good one), demultiplexed: http://twrzes.wtvk.pl/run1_R1_fastqc.html and http://twrzes.wtvk.pl/run1_R2_fastqc.html

Run1, after trimming: http://twrzes.wtvk.pl/run1_R1_trimmed_fastqc.html and http://twrzes.wtvk.pl/run1_R2_trimmed_fastqc.html

Run2 (the bad one), demultiplexed: http://twrzes.wtvk.pl/run2_R1_fastqc.html and http://twrzes.wtvk.pl/run2_R2_fastqc.html

Run2, after trimming: http://twrzes.wtvk.pl/run2_R1_trimmed_fastqc.html and http://twrzes.wtvk.pl/run2_R2_trimmed_fastqc.html

Command-line commands I used for:

1) Demultiplexing:

/path/to/configureBclToFastq.pl --input-dir /path/to/folder/with/BCLs/Data/Intensities/BaseCalls --output-dir /path/to/folder/with/BCLs/Unaligned --sample-sheet /path/to/folder/with/BCLs/sample-sheet.csv --fastq-cluster-count 0 --mismatches 1 --with-failed-reads

2) Trimming (TrimGalore-0.4.0, a wrapper for cutadapt-1.8.3):

trimgalore --paired --quality 20 --illumina --stringency 1 -e 0.2 --length 40 -o /path/to/trimmed/fastq --trim1 run1_R1.fastq run1_R2.fastq

3) Alignment (Bowtie2-2.2.6)

bowtie2 -p 12 -I 0 -X 2000 --dovetail --very-sensitive-local -N 1 -x /path/to/index/index -1 run1_R1_trimmed.fastq -2 run1_R2_trimmed.fastq -S /path/to/aligned/sam/run1.sam

If you need any additional info, I would be more than happy to provide it.

Thank you very much for your efforts on solving this problem.

Kind regards,
Tomasz Wrzesinski

--
Tomasz Wrzesinski, MSc
PhD Student
Laboratory of High Throughput Technologies
Institute of Molecular Biology and Biotechnology
Faculty of Biology
Adam Mickiewicz University in Poznan
Umultowska 89/1.117
61-614 Poznan, Poland
tel. +48 61 829 5833
e-mail: twrzes@amu.edu.pl

rna-seq next-gen alignment • 1.5k views
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by twrzes20

You mixed your links: "run1, after trimming" actually points to raw run2, and "run2, demultiplexed" points to run1 after trimming.

ADD REPLYlink written 3.7 years ago by h.mon25k

I am very sorry for that, I edited my post so now everything should be OK.

Thank you for pointing out my mistake.

Kind regards,

Tomasz Wrzesinski

ADD REPLYlink written 3.7 years ago by twrzes20

Did you run FastQC before and after read cleaning of both runs?

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Irsan6.9k

Yes, I did, links are provided in my post (below Bowtie2 reports).

Kind regards,

Tomasz Wrzesinski

ADD REPLYlink written 3.7 years ago by twrzes20
2
gravatar for h.mon
3.7 years ago by
h.mon25k
Brazil
h.mon25k wrote:

My first approach would be assemble the run2 unmapped reads and blast the contigs, looking for contaminants. Alternatively, you could blast a sample of the unmapped raw reads. It may be interesting to assemble and map run1 unmapped reads as well, as a control.

Are you using a stranded or unstranded library preparation protocol?

Looking at the FastQC reports, both runs seems just fine. The only suspicious thing I noticed is GC content seems to be slightly different (1%-2%) between runs, and on run2 %A is consistently higher than %T.

ADD COMMENTlink written 3.7 years ago by h.mon25k

Dear h.mon,

Thank you very much for your answer. I will perform the assembly on unmapped reads to search for contaminants. However, if this was the contamination of a library, shouldn't it be visible in these two runs?

This is an unstranded protocol.

I checked other samples for this phenomenon and the difference you wrote about is visible only in the case of this library - remaining 7 libraries look the same when looking at the base %. This is strange...

Thank you very much for once again for your time.

Kind regards,
Tomasz Wrzesinski

ADD REPLYlink written 3.7 years ago by twrzes20
0
gravatar for twrzes
3.7 years ago by
twrzes20
United Kingdom
twrzes20 wrote:

Dear h.mon,

I haven't done the assembly yet, but I ran few reads through BLAST and these reads came from Homo sapiens.

After a consultation with a person who actually is responsible for running our sequencer it turned out that she put additional sample (human DNA), but she didn't see that indices for the sample I am analyzing and this human DNA library are identical. Therefore, apparently, the base composition of each read was different.

Thank you once again for your tremendous help - I really appreciate your input.

Kind regards,

Tomasz

ADD COMMENTlink written 3.7 years ago by twrzes20
1

You are welcome. So now you have an easy solution, map run2 against the human genome, the unmapped reads should be mostly from your species.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by h.mon25k

I will definitely follow your advice - thank you once again for your input.

Kind regards,

Tomasz

ADD REPLYlink written 3.7 years ago by twrzes20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1888 users visited in the last hour