Question: Too low mapping percentage using HISAT2 on human reference genome.
0
gravatar for Vijay Lakhujani
3 months ago by
Vijay Lakhujani1.4k
India
Vijay Lakhujani1.4k wrote:

Hi all,

I am working with RNA-seq data of different drug treated human cell lines sequenced on Illumina (2X150 bp chemistry). Data generated is 11-14 Gb.

I am using HISAT2 for alignment on human genome hg38 build downloaded from Ensembl database using default parameters. To my surprise, the alignment percentage is very low (~2 %) for all the samples.

I have the following observations regarding the data quality :

  1. Data quality (phred score) is good i.e. in the range of 32-40!
  2. Illumina adapters were already removed using trimmomatic. No other contaminants.
  3. The fastqc metric 'Per base sequence content' fails for all samples. More specifically, A,T,G,C % is deviating considerably at the first 10 base positions.

Questions

  1. What could possibly be wrong? Any other QC stuff I need to re-consider?
  2. More importantly, this is cell-line data, so shall I consider mapping on specific cell line reference OR is it okay if I have just the hg38 standard human reference?
hisat2 cell line genome • 451 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by Vijay Lakhujani1.4k
2

Hi Vijay,

Can you BLAST few reads manually and confirm that there was no issue with de-multiplexing or sample labelling? Specific reference should not be an issue. You can also try soft trimming first 10 bases during alignment. Its not unusual.

Best

ADD REPLYlink modified 3 months ago • written 3 months ago by Satyajeet Khare1.1k

Thanks Satya! I am shuffling the reads and will blast a few thousands to check for the same.

ADD REPLYlink written 3 months ago by Vijay Lakhujani1.4k
2

With 2% alignment you don't need a few thousand. Take 10 and that should be plenty to diagnose the issue with BLAST.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax39k

It appears to be a clear case of contamination rather confusion.

Evidence1: Blasting the reads againt NCBI nt database suggest it's hamster. May be chinese or golden hamster.

Evidence2: Mapping against hamster genome gave ~80 % mapping for all samples with HISAT2.

Last thing, under investigation is blasting the scaffolds of the assembly from one of the samples. If that also hits hamster, which I am 90% sure would be, I have to traceback the history! :)

ADD REPLYlink modified 3 months ago • written 3 months ago by Vijay Lakhujani1.4k
1

Ouch! Maybe time to start running FastQ Screen routinely.. :)

ADD REPLYlink written 3 months ago by Phil Ewels70
1

Or BBSketch, which only takes a couple seconds :)

ADD REPLYlink written 3 months ago by Brian Bushnell15k

Mapping against hamster genome gave ~80 % mapping for all samples with HISAT2.

Yikes! So the cell line is actually hamster then?

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax39k
1

Looks like CHO cell line contamination.

ADD REPLYlink written 3 months ago by Satyajeet Khare1.1k

Another thing I wanted to add is the following information which I found in the STAR aligner documentation:

It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes.

I have infact kept the sequences of only chr1-22, X, Y and mito DNA in my reference genome file. Could that possibly be the reason. Meantime, I am trying to align the reads of the original reference file and see how it goes!

ADD REPLYlink written 3 months ago by Vijay Lakhujani1.4k
1

I would be very surprised if this has much effect on your alignments. These extra scaffolds are recommended to "soak up" reads which would otherwise misalign, but they shouldn't account for more than a few of % of your library. That can be a problem for downstream analysis, but won't result in a global 2% alignment rate. Even aligning Human data against the mouse reference genome can give more alignments than that ;)

ADD REPLYlink written 3 months ago by Phil Ewels70

HAHAHAHA :D

Adding another twist here. What I have used here is the toplevel file and NOT the primary assembly file. I am looking at this blog and what I found is :

"Primary assembly" does not contain the haplotype information. The top level file contains additional sequences that are relatively common variants to the reference. Most mappers available now don't specifically handle these haplo sequences and as such they will appear as simply another contig, therefore complicating the alignment. Perhaps in future, mappers might be better able to handle these hypervariable regions.

So, I am trying the mapping with the primary assembly file as well. Let's see!

ADD REPLYlink written 3 months ago by Vijay Lakhujani1.4k
1

I don't think toplevel file is an issue either. I have been using toplevel file for one model organism and never suffered from low alignment.

ADD REPLYlink written 3 months ago by Satyajeet Khare1.1k

It's probably worth trying the 5' trimming approach from my answer below first.. ;) I'm pretty sure that'll fix your problems and it's not to do with your reference.

ADD REPLYlink written 3 months ago by Phil Ewels70

+1 Phil. Indeed, I just confirmed that it does not make much difference. Prima facie, it looks like that there is indeed some issue with the data itself; probable contamination which I am going to confirm further.

ADD REPLYlink written 3 months ago by Vijay Lakhujani1.4k

A short answer to my 2nd question is that, Yes, I can use the human reference itself. I found a good paper based on a similar study here.

ADD REPLYlink written 3 months ago by Vijay Lakhujani1.4k

While I know that BowTie2 is not a suitable aligner for RNA-seq data, I am getting ~25% mapping with human reference in this case. How can we explain this?

ADD REPLYlink written 3 months ago by Vijay Lakhujani1.4k
1

bowtie v.1 does ungapped alignments as opposed to bowtie v.2 which does gapped alignments.

With that 25%, it is possible that you are only seeing alignments to regions that are homologous between human and hamster. You could use bbsplit.sh from BBMap with human/hamster genomes to bin your reads (discarding those that map to both genomes). This may be more of a diagnostic exercise.

If this data was supposed to be for a human cell line then this experiment is a lost cause. You (or someone else) is going to have to start backtracking (sequencing --> library prep --> sample submission --> RNA prep --> cell lines) to identify where the problem may lie. It could be at any one of these steps unless the original cell line can be quickly identified to be the root cause.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax39k

RNA-seq data has reads that span exon-exon boundaries. BowTie will not align these reads, so you'll only get back alignments that lie entirely within an exon. Tophat / STAR etc are "splice aware" aligners, so work with RNA-seq data.

ADD REPLYlink written 3 months ago by Phil Ewels70
1
gravatar for h.mon
3 months ago by
h.mon9.6k
Brazil
h.mon9.6k wrote:

I've seen cell lines being contaminated or swapped by other lines - in the "good" cases by cell lines from other species (good because when this happens it is really easy to detect). Your low mapping rate suggests you are mapping to the wrong species, Centrifuge is really handy to detect this kind of contamination - the problem is to download or build a nt index.

ADD COMMENTlink written 3 months ago by h.mon9.6k

Whilst I agree that contamination with another species is a concern, I can't see how this would affect the base composition for the first 10 bp of each read?

ADD REPLYlink written 3 months ago by Phil Ewels70
1

Yes, this is a different issue, I would guess less-than random hexamer priming.

ADD REPLYlink written 3 months ago by h.mon9.6k

Ah yes, absolutely. I guess it depends on how extreme the base composition wobbling is.. I was imagining 100% for each base but on re-reading the OP it could well be hexamer priming as you say.

ADD REPLYlink written 3 months ago by Phil Ewels70

I've seen cell lines being contaminated or swapped by other lines

Dear h.mon,

How common is this observation? What is the most suitable explanation of this observation? This will help me trace back the issue.

ADD REPLYlink modified 3 months ago • written 3 months ago by Vijay Lakhujani1.4k
1

I don't know how common is this, I help several people troubleshoot problematic samples from diverse sources. I have seen this happen exactly twice, but I don't know twice among how many cell lines without problems.

Once it was contamination, the other I believe it was mislabeling. Not my samples though, and once I detect the problem in general I don't get much feedback.

ADD REPLYlink written 3 months ago by h.mon9.6k
0
gravatar for Phil Ewels
3 months ago by
Phil Ewels70
Sweden / Stockholm / SciLifeLab
Phil Ewels70 wrote:

Hi Vijay,

Do you know how the library was prepared? Strong deviation for the first 10 bases in the nucleotide composition sounds very much like an adapter sequence. Some library preparations (eg. the Clontech SMARTer Pico kit, amongst others) require trimming from the 5' of sequences before alignment. These adapters are different from the usual illumina sequenceing adapters and will be removed by default by most trimming tools.

If you know the library preparation kit, you can look up the kit's documentation and you'll usually find instructions. For example, from the Pico docs:

IMPORTANT: The first three nucleotides of the first sequencing read (Read 1) are derived from the template-switching oligo. These three nucleotides must be trimmed prior to mapping.

But, as you already have the nucleotide content plot you can do without this and just remove the first 10 bases from the 5' of each read. Then your alignment scores should hopefully improve..

I hope this helps!

Phil

ADD COMMENTlink written 3 months ago by Phil Ewels70

Unfortunately, I have a limited information on the background wetlab procedure. But, yes let me try the same. Thanks Phil! :)

ADD REPLYlink written 3 months ago by Vijay Lakhujani1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1425 users visited in the last hour