Question

Too low mapping percentage using HISAT2 on human reference genome.

3

Entering edit mode

6.6 years ago

lakhujanivijay 5.8k

Hi all,

I am working with RNA-seq data of different drug treated human cell lines sequenced on Illumina (2X150 bp chemistry). Data generated is 11-14 Gb.

I am using HISAT2 for alignment on human genome hg38 build downloaded from Ensembl database using default parameters. To my surprise, the alignment percentage is very low (~2 %) for all the samples.

I have the following observations regarding the data quality :

Data quality (phred score) is good i.e. in the range of 32-40!
Illumina adapters were already removed using trimmomatic. No other contaminants.
The fastqc metric 'Per base sequence content' fails for all samples. More specifically, A,T,G,C % is deviating considerably at the first 10 base positions.

Questions

What could possibly be wrong? Any other QC stuff I need to re-consider?
More importantly, this is cell-line data, so shall I consider mapping on specific cell line reference OR is it okay if I have just the hg38 standard human reference?

hisat2 genome cell line • 6.7k views

ADD COMMENT • link 6.6 years ago by lakhujanivijay 5.8k

3

Entering edit mode

Hi Vijay,

Can you BLAST few reads manually and confirm that there was no issue with de-multiplexing or sample labelling? Specific reference should not be an issue. You can also try soft trimming first 10 bases during alignment. Its not unusual.

Best

ADD REPLY • link 6.6 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

Thanks Satya! I am shuffling the reads and will blast a few thousands to check for the same.

ADD REPLY • link 6.6 years ago by lakhujanivijay 5.8k

2

Entering edit mode

With 2% alignment you don't need a few thousand. Take 10 and that should be plenty to diagnose the issue with BLAST.

ADD REPLY • link 6.6 years ago by GenoMax 141k

1

Entering edit mode

It appears to be a clear case of contamination rather confusion.

Evidence1: Blasting the reads againt NCBI nt database suggest it's hamster. May be chinese or golden hamster.

Evidence2: Mapping against hamster genome gave ~80 % mapping for all samples with HISAT2.

Last thing, under investigation is blasting the scaffolds of the assembly from one of the samples. If that also hits hamster, which I am 90% sure would be, I have to traceback the history! :)

ADD REPLY • link 6.6 years ago by lakhujanivijay 5.8k

1

Entering edit mode

Ouch! Maybe time to start running FastQ Screen routinely.. :)

ADD REPLY • link 6.6 years ago by Phil Ewels ★ 1.4k

1

Entering edit mode

Or BBSketch, which only takes a couple seconds :)

ADD REPLY • link 6.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Mapping against hamster genome gave ~80 % mapping for all samples with HISAT2.

Yikes! So the cell line is actually hamster then?

ADD REPLY • link 6.6 years ago by GenoMax 141k

1

Entering edit mode

Looks like CHO cell line contamination.

ADD REPLY • link 6.6 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

Another thing I wanted to add is the following information which I found in the STAR aligner documentation:

It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes.

I have infact kept the sequences of only chr1-22, X, Y and mito DNA in my reference genome file. Could that possibly be the reason. Meantime, I am trying to align the reads of the original reference file and see how it goes!

ADD REPLY • link 6.6 years ago by lakhujanivijay 5.8k

1

Entering edit mode

I would be very surprised if this has much effect on your alignments. These extra scaffolds are recommended to "soak up" reads which would otherwise misalign, but they shouldn't account for more than a few of % of your library. That can be a problem for downstream analysis, but won't result in a global 2% alignment rate. Even aligning Human data against the mouse reference genome can give more alignments than that ;)

ADD REPLY • link 6.6 years ago by Phil Ewels ★ 1.4k

0

Entering edit mode

HAHAHAHA :D

Adding another twist here. What I have used here is the toplevel file and NOT the primary assembly file. I am looking at this blog and what I found is :

"Primary assembly" does not contain the haplotype information. The top level file contains additional sequences that are relatively common variants to the reference. Most mappers available now don't specifically handle these haplo sequences and as such they will appear as simply another contig, therefore complicating the alignment. Perhaps in future, mappers might be better able to handle these hypervariable regions.

So, I am trying the mapping with the primary assembly file as well. Let's see!

ADD REPLY • link 6.6 years ago by lakhujanivijay 5.8k

1

Entering edit mode

I don't think toplevel file is an issue either. I have been using toplevel file for one model organism and never suffered from low alignment.

ADD REPLY • link 6.6 years ago by Satyajeet Khare ★ 1.6k

1

Entering edit mode

It's probably worth trying the 5' trimming approach from my answer below first.. ;) I'm pretty sure that'll fix your problems and it's not to do with your reference.

ADD REPLY • link 6.6 years ago by Phil Ewels ★ 1.4k

0

Entering edit mode

+1 Phil. Indeed, I just confirmed that it does not make much difference. Prima facie, it looks like that there is indeed some issue with the data itself; probable contamination which I am going to confirm further.

ADD REPLY • link 6.6 years ago by lakhujanivijay 5.8k

0

Entering edit mode

A short answer to my 2nd question is that, Yes, I can use the human reference itself. I found a good paper based on a similar study here.

ADD REPLY • link 6.6 years ago by lakhujanivijay 5.8k

0

Entering edit mode

While I know that BowTie2 is not a suitable aligner for RNA-seq data, I am getting ~25% mapping with human reference in this case. How can we explain this?

ADD REPLY • link 6.6 years ago by lakhujanivijay 5.8k

1

Entering edit mode

RNA-seq data has reads that span exon-exon boundaries. BowTie will not align these reads, so you'll only get back alignments that lie entirely within an exon. Tophat / STAR etc are "splice aware" aligners, so work with RNA-seq data.

ADD REPLY • link 6.6 years ago by Phil Ewels ★ 1.4k

1

Entering edit mode

bowtie v.1 does ungapped alignments as opposed to bowtie v.2 which does gapped alignments.

With that 25%, it is possible that you are only seeing alignments to regions that are homologous between human and hamster. You could use bbsplit.sh from BBMap with human/hamster genomes to bin your reads (discarding those that map to both genomes). This may be more of a diagnostic exercise.

If this data was supposed to be for a human cell line then this experiment is a lost cause. You (or someone else) is going to have to start backtracking (sequencing --> library prep --> sample submission --> RNA prep --> cell lines) to identify where the problem may lie. It could be at any one of these steps unless the original cell line can be quickly identified to be the root cause.

ADD REPLY • link 6.6 years ago by GenoMax 141k

score 2 · Accepted Answer · 2017-09-05

Hi Vijay,

Do you know how the library was prepared? Strong deviation for the first 10 bases in the nucleotide composition sounds very much like an adapter sequence. Some library preparations (eg. the Clontech SMARTer Pico kit, amongst others) require trimming from the 5' of sequences before alignment. These adapters are different from the usual illumina sequenceing adapters and will be removed by default by most trimming tools.

If you know the library preparation kit, you can look up the kit's documentation and you'll usually find instructions. For example, from the Pico docs:

IMPORTANT: The first three nucleotides of the first sequencing read (Read 1) are derived from the template-switching oligo. These three nucleotides must be trimmed prior to mapping.

But, as you already have the nucleotide content plot you can do without this and just remove the first 10 bases from the 5' of each read. Then your alignment scores should hopefully improve..

I hope this helps!

Phil

score 2 · Accepted Answer · 2017-09-06

2

Entering edit mode

6.6 years ago

h.mon 35k

I've seen cell lines being contaminated or swapped by other lines - in the "good" cases by cell lines from other species (good because when this happens it is really easy to detect). Your low mapping rate suggests you are mapping to the wrong species, Centrifuge is really handy to detect this kind of contamination - the problem is to download or build a nt index.

ADD COMMENT • link 6.6 years ago by h.mon 35k

1

Entering edit mode

Whilst I agree that contamination with another species is a concern, I can't see how this would affect the base composition for the first 10 bp of each read?

ADD REPLY • link 6.6 years ago by Phil Ewels ★ 1.4k

1

Entering edit mode

Yes, this is a different issue, I would guess less-than random hexamer priming.

ADD REPLY • link 6.6 years ago by h.mon 35k

1

Entering edit mode

Ah yes, absolutely. I guess it depends on how extreme the base composition wobbling is.. I was imagining 100% for each base but on re-reading the OP it could well be hexamer priming as you say.

ADD REPLY • link 6.6 years ago by Phil Ewels ★ 1.4k

0

Entering edit mode

I've seen cell lines being contaminated or swapped by other lines

Dear h.mon,

How common is this observation? What is the most suitable explanation of this observation? This will help me trace back the issue.

ADD REPLY • link 6.6 years ago by lakhujanivijay 5.8k

1

Entering edit mode

I don't know how common is this, I help several people troubleshoot problematic samples from diverse sources. I have seen this happen exactly twice, but I don't know twice among how many cell lines without problems.

Once it was contamination, the other I believe it was mislabeling. Not my samples though, and once I detect the problem in general I don't get much feedback.

ADD REPLY • link 6.6 years ago by h.mon 35k