We are testing cDNA sequencing on an ONT MinION, and I am trying to get my head around the different data analysis pipelines.
We did the analysis directly on the MinION (basecalling with Guppy + alignment with minimap2) using the mouse genome ENSEMBL reference found here.
The result is a lot of small fastq + bam files. I joined all the bam files using samtools and got a single bam file which I can display in IGV.
Here is an example of what I see which, at least for this particular gene, seems to match the RefSeq annotation relatively nicely.
As a different strategy, I tried to merge all fastq.gz files from the MinION using
cat *.fastq.gz > merged.fastq.gz
then align them to the same genome using minimap2 after filtering for quality >8
zcat merged.fastq.gz | NanoFilt -q 8 --headcrop 50 | gzip > filtered-reads.fastq.gz ./minimap2 -ax splice Mus_musculus.GRCm39.dna.primary_assembly.fa filtered-reads.fastq.gz | samtools sort -o merged.bam
If I now look at this in IGV I see this (bottom is the bam from the MinION as in the previous image, top is my mapping)
So, the alignment is completely different, and there are a lot of mismatched bases...
Clearly I am doing something wrong, but I never used minimap2 before and I find the documentation sometimes a bit cryptic.
So, here are a few questions I hope you can help me answer:
- Am I correct in using that reference genome or should I use something else (that is what I normally use for aligning short reads with STAR)
- Why are the two approaches giving such different results?
- Even in the analysis from the MinION, I can see a lot of reads aligning outside of exons; how do I interpret that?