Kallisto pseudobam to IGV
3
2
Entering edit mode
5.1 years ago
mac ▴ 70

Hello,

I am analysing some RNA-seq data with kallisto and I have created a series of sam files using --pseudobam in kallisto.

Is there a way of converting these files in a format that can be easily viewed on IGV?

RNA-Seq kallisto pseudobam IGV alignment • 5.8k views
0
Entering edit mode

Is that bam as accurate and useful than those obtained with splicing aware mappers like STAR o HISAT ?

1
Entering edit mode

Not really, no. As the pseudo alignment will only assign transcriptomic reads that it can definitively say belong to a given transcript, and the abundance estimation then uses an EM (expectation maximisation) function to basically fill in the gaps.

2
Entering edit mode
5.1 years ago

Per @Devon's answer here, you can use samtools to sort your SAM and convert it to BAM.

samtools view -u alignment.sam | samtools sort -@ 4 - output_prefix


Edit: The question is referring to transcriptomic vs genomic mappings for IGV

There's a python script available here found in this answer from SeqAnswers.

Usage:

python3 kallisto_sam_convertor.py pseudoalignment.sam annotation.gtf | samtools view -bS - | samtools sort - -o <output.bam>

bam should be loadable to IGV.

0
Entering edit mode

Thank you for your reply but my understanding is that the kallisto psedobam (which is actually sam) have coordinates based on the transcriptome used to make the index.

How can I convert these to genomic coordinates so I can make an indexed bam to view in IGV?

0
Entering edit mode

0
Entering edit mode

Thank you, I will try this :)

2
Entering edit mode
4.5 years ago
mmfansler ▴ 410

## Use --genomebam

Introduced in kallisto v0.44.0 is a --genomebam flag, which has quant generate a genome aligned BAM. You must provide an annotation GTF file via the --gtf flag. It is also recommended to include a chrom.sizes file with the --chromsomes flag.

It might be worth mentioning that I wasn't able to immediately use GENCODE's files for mm10, e.g., gene GTF and transcripts FASTA. The kallisto indexer interpreted the full line from the FASTA read headers as the transcript IDs, but since the headers include much more, the pseduobam procedure couldn't resolve them to a transcript ID in the GTF. It still generated the full BAM, but everything was aligned to "*". Reprocessing the FASTA to only include the transcript ID (everything up to first "|") and reindexing on that got everything working.

Another issue I've come across is a mismatch between the chromosome names in the GTF and those in the chrom.sizes file. Kallisto will throw a warning about not finding chromosomes for transcripts if this is the case. A simple test to see if this is the issue is to run without a --chromosomes flag, in which case, kallisto defaults to simply using the chromosome names in the GTF.

### Troubleshooting Custom Transcriptomes/GTF

If you are generating your own GTF, it might be worth having a look at some compatibility guidelines I posted over on the kallisto repository. Simple sorting by seqname + start is not sufficient, e.g. exons must be rank-ordered.

1
Entering edit mode

The gencode annotation was the issue for me as well.

The following command edits the gencode fasta read headers to only include the first transcript field:

sed -r 's/^(>[^\|]+)\|.*/\1/g' gencode.v28.transcripts.fa > gencode.v28.transcripts.onlyTxLineHeaders.fa


After reindexing on this fasta, the "mapping to genome" step worked, and the output files had chromosome names and positions included.

0
Entering edit mode

Okay, I will try this. I'm just wondering whether the --chromosomes flag is essential or not.

0
Entering edit mode

It is optional, but if you want an accurate header in the BAM you should use the --chromosomes flag. I tried running without it, and this resulted in all chromosomes being assigned LN:536870911 in the header. A chrom.sizes file is usually readily available. E.g., simple searching "mm10 chrom.sizes" gives me the file in the first link.

0
Entering edit mode
5.0 years ago
wajm97 • 0

Is this working properly? pseudobam and the script are not working for me.

0
Entering edit mode

You'll have to give more detail than that, or contact the author of the script.

0
Entering edit mode

I got this message.

0
Entering edit mode

and test another GTF file.

1
Entering edit mode

Hi, the author is here. I wrote this script for one use only (I was not trying to make a very general resource for everybody, but maybe I should).

The problem is that you got differently formatted gtf file and I am using instead of some implemented parsers simply my own. These are lines parsing a gtf file into two dictionaries of transcript pointing to their position in genome (one dictionary for + strand and one for -):

pos_tr = {}
neg_tr = {}
with open(sys.argv[2],'r') as anotationfile:
for entry in anotationfile:
entry = entry.split('\t')
key = entry[8].split('"')[1]
if entry[6] == '+':
pos_tr[key] = entry[3]
else:
neg_tr[key] = entry[3]


The quick solution would be just to find how to read the name of transcript and replace entry[8].split('"')[1] by by an expression what will get a transcript name for your version of gtf.

I also found that there are two bits, that will rename your reference to S5_genome. lines 21 - 23 :

if samline[:3] == '@HD':
print(samline, end='')
print("@SQ  SN:S5_genome    LN:6778833")


and line 30

print('S5_genome', end="\t")


This was not intended as general utility, but a solution for a project during genomics course where we wanted to be able to inspect the job of kallisto (that's why the repo is called after the course : sequence a genome).