Visualizing alignment of kallisto .bam files to hg19 (and custom plasmids)
2
0
Entering edit mode
2.7 years ago

Hello,

I'm trying to visualize kallisto read alignment to a human genome plus an additional sequence using IGV. I can produce .bam and .bai files using kallisto, but when I open them in IGV alongside the built-in "hg19" option, I don't see anything:

# Gathering Files

First, I downloaded my reference human genome from here, along with a gtf:

wget 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/mrna.fa.gz'


Gtf downloaded from the UCSC genome browser using these instructions: hg38 annotation file (gtf) UCSC format

head plasmidFasta/plas_shp_up.fa
>plas_shp
GTCGACATTGATTATTGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATA
GCCCATATATGGAGTTCCGCGTTACATAACTTACGGTAAATGGCCCGCCTGGCTGACCGC
CCAACGACCCCCGCCCATTGACGTCAATAATGACGTATGTTCCCATAGTAACGCCAATAG
GGACTTTCCATTGACGTCAATGGGTGGAGTATTTACGGTAAACTGCCCACTTGGCAGTAC
ATCAAGTGTATCATATGCCAAGTACGCCCCCTATTGACGTCAATGACGGTAAATGGCCCG
CCTGGCATTATGCCCAGTACATGACCTTATGGGACTTTCCTACTTGGCAGTACATCTACG
TATTAGTCATCGCTATTACCATGGTCGAGGTGAGCCCCACGTTCTGCTTCACTCTCCCCA
TCTCCCCCCCCTCCCCACCCCCAATTTTGTATTTATTTATTTTTTAATTATTTTGTGCAG
CGATGGGGGCGGGGGGGGGGGGGGGGCGCGCGCCAGGCGGGGCGGGGCGGGGCGAGGGGC


# Producing Index

I begin by producing my kallisto index. The code looks like this:

#declare Variables
humFa="myPath/hg19_transcriptome/*.fa"
plasFa="myPath/plasmidFasta/*.fa"

# run index gen
kallisto index -i /myPath/humPlas_kallisto_transcripts_plasmids_hg19.idx $humFa$plasFa --make-unique


# Producing .bam file

I modified the gtf file I downloaded above to add my extra sequences, so it looked like this:

tail myGtf.gtf
chr19_gl000209_random   hg19_refGene    stop_codon  145118  145120  0.000000    +   .   gene_id "NM_012312"; transcript_id "NM_012312";
chr19_gl000209_random   hg19_refGene    exon    145079  145743  0.000000    +   .   gene_id "NM_012312"; transcript_id "NM_012312";
plas_shp    AddedGenes  exon    1   12915   .   +   0   gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp"; tss_id "plas_shp";


Then I aligned kallisto to the index I produced above using the above:

# run index gen
kallisto quant -i $kallistoIdx -o$outputFileLoc/$sample --genomebam --gtf$myGtf $Read1$Read2 | samtools view -bS > sample_align.bam  When I convert the output .bam file to .sam, it looks like this: NS500127:27:HW5JKBGXY:1:11101:17213:1141 131 * 0 0 76M * 0 0 GGCGTTGCTGCGGGTGGCCTGCGTGTGCCGCTTATGGAGGGAGTGTGTGCGCAGAGTATTGCGGACCCATCGGAGC AAAAAEEEEEAEEEAEEEEE/AEE/AEEEEEEAEEEE/EEEEE/EAE/AAE//E/AEE/EEAEA//E/<A/EEEEA ZW:f:0 NS500127:27:HW5JKBGXY:1:11101:24672:1142 67 * 0 0 76M * 0 0 TTTGATCTTCTTAGTTTTCTTCTTCTTATCCTTACCGCTGTCATCCTCCTCATCTGAACCCACATCTTCGATCTTG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEAEEEEEEEEEAE/EEEEEEEEEEEE//EEEEA ZW:f:0  # The problem When I try to open the .bam file in IGV, I don't see any alignments. I've tried doing the above with hg19/GRCh37 and 38 files from different locations. I've tried including and excluding the chromSizes file. I'm not sure what else to try. Is there anything immediately obviously problematic about my approach? Is there a viewer that might be more forgiving than IGV when it comes to visualizing kallisto .bam files? Has anyone successfully managed to visualize their kallisto .bam files in IGV before? Thank you for your help! IGV kallisto RNA-Seq • 1.9k views ADD COMMENT 2 Entering edit mode 2.7 years ago pmelsted ▴ 110 Everything up to the GTF file was correct. The only thing missing is that kallisto requires a gene entry for each transcript and the exons of each transcript. Also it's important the the field separators for the GTF are TAB characters, where as the key-value entries are separted by space and ; plas_shp AddedGenes gene 1 12915 . + 0 gene_id "plas_shp"; gene_name "plas_shp"; plas_shp AddedGenes transcript 1 12915 . + 0 gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp"; plas_shp AddedGenes exon 1 12915 . + 0 gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp"; tss_id "plas_shp";  this line kallisto quant -ikallistoIdx -o $outputFileLoc/$sample --genomebam --gtf $myGtf$Read1 $Read2 | samtools view -bS >$sample_align.bam is unnecessary since kallisto with --genomebam doesn't write to standard output, but produces a sorted indexed BAM file under outputFolder/pseudoalignments.bam.

After this you should be able to view this using IGV. You need to either index your plasmid fasta file using samtools faidx or you can add them as extra chromosomes to the Human reference (but give them a new chromosome name and modify the GTF and run samtools faidx again).

I tested this on a small example

@r1
TGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATAGCCCATATATGGAGT
+
AAAAAEEEEEAEEEAEEEEE/AEE/AEEEEEEAEEEE/EEEEE/EAE/AAE//E/AEE/E
@r2
TGGGCATAATGCCAGGCGGGCCATTTACCGTCATTGACGTCAATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEA


using the head of the plasmid sequence you provided and got the following results kallisto quant -i ind -o out --genomebam --gtf plas.gtf --single -l 200 -s20 r1.fastq and looking at the output samtools view out/pseudoalignments.bam

r1  0   plas_shp    16  255 60M *   0   0   TGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATAGCCCATATATGGAGT    AAAAAEEEEEAEEEAEEEEE/AEE/AEEEEEEAEEEE/EEEEE/EAE/AAE//E/AEE/E    ZW:f:1
r2  16  plas_shp    272 255 45M *   0   0   TATTGACGTCAATGACGGTAAATGGCCCGCCTGGCATTATGCCCA   AEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA   ZW:f:1

0
Entering edit mode

Thank you for this extremely helpful answer! I'll give it a try!

0
Entering edit mode

Thank you again for your suggested tips. I tried editing my gtf, removing the samview portion of my kallisto quant line, and re-running the pipeline. This time, I did it both the original way (trying to build index from plasmid+ transcriptome) as well as a simplified vdrsion (trying to build index from transcriptome .fa only). Unfortunately, both versions failed to align properly (e.g. sam output very similar to what I posted above with no chromosome names) and I could not visualize them on IGV. I'll focus on figuring out what's wrong with just the basic transcriptome alignment.

Regarding your other suggestion using samtools faidx - could you help me understand how this would enter into the pipeline? I see that it makes .fa.fai files, but if you aren't feeding .fai files into kallisto index, just .fa files (see section "producing index" for example), why would it matter they exist?

Thank you again for your help!

0
Entering edit mode
2.7 years ago
swbarnes2 9.9k

Your bam has no indication whatsoever what chromosome each read goes to. So of course IGV can't put reads on the human chromosomes.

If you want to look at alignments in IGV, you need to align to the human genome with a real aligner, like STAR, not just count transcripts with kallisto.

0
Entering edit mode

Thanks for pointing out where it's falling down. You actually can use IGV with kallisto now: https://pachterlab.github.io/kallisto/pseudobam.html

0
Entering edit mode

Thanks for pointing out where it's falling down. You actually can use IGV with kallisto now: https://pachterlab.github.io/kallisto/pseudobam.html

0
Entering edit mode

Thanks for pointing out where it's falling down. You actually can use IGV with kallisto now: https://pachterlab.github.io/kallisto/pseudobam.html