Question: Kallisto pseudobam to IGV
gravatar for mac
3.2 years ago by
mac60 wrote:


I am analysing some RNA-seq data with kallisto and I have created a series of sam files using --pseudobam in kallisto.

Is there a way of converting these files in a format that can be easily viewed on IGV?

Thank you in advance! :)

ADD COMMENTlink modified 2.6 years ago by mmfansler350 • written 3.2 years ago by mac60

In addition..

Is that bam as accurate and useful than those obtained with splicing aware mappers like STAR o HISAT ?

ADD REPLYlink written 3.2 years ago by Antonio R. Franco4.5k

Not really, no. As the pseudo alignment will only assign transcriptomic reads that it can definitively say belong to a given transcript, and the abundance estimation then uses an EM (expectation maximisation) function to basically fill in the gaps.

ADD REPLYlink written 3.2 years ago by andrew.j.skelton736.0k
gravatar for andrew.j.skelton73
3.2 years ago by
andrew.j.skelton736.0k wrote:

Per @Devon's answer here, you can use samtools to sort your SAM and convert it to BAM.

samtools view -u alignment.sam | samtools sort -@ 4 - output_prefix

Edit: The question is referring to transcriptomic vs genomic mappings for IGV

There's a python script available here found in this answer from SeqAnswers.


python3 pseudoalignment.sam annotation.gtf | samtools view -bS - | samtools sort - -o <output.bam>

bam should be loadable to IGV.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by andrew.j.skelton736.0k

Thank you for your reply but my understanding is that the kallisto psedobam (which is actually sam) have coordinates based on the transcriptome used to make the index.

How can I convert these to genomic coordinates so I can make an indexed bam to view in IGV?

ADD REPLYlink written 3.2 years ago by mac60

I've updated my answer

ADD REPLYlink written 3.2 years ago by andrew.j.skelton736.0k

Thank you, I will try this :)

ADD REPLYlink written 3.2 years ago by mac60
gravatar for mmfansler
2.6 years ago by
MSKCC | New York, NY
mmfansler350 wrote:

Use --genomebam

Introduced in kallisto v0.44.0 is a --genomebam flag, which has quant generate a genome aligned BAM. You must provide an annotation GTF file via the --gtf flag. It is also recommended to include a chrom.sizes file with the --chromsomes flag.

No reads showing?

It might be worth mentioning that I wasn't able to immediately use GENCODE's files for mm10, e.g., gene GTF and transcripts FASTA. The kallisto indexer interpreted the full line from the FASTA read headers as the transcript IDs, but since the headers include much more, the pseduobam procedure couldn't resolve them to a transcript ID in the GTF. It still generated the full BAM, but everything was aligned to "*". Reprocessing the FASTA to only include the transcript ID (everything up to first "|") and reindexing on that got everything working.

Another issue I've come across is a mismatch between the chromosome names in the GTF and those in the chrom.sizes file. Kallisto will throw a warning about not finding chromosomes for transcripts if this is the case. A simple test to see if this is the issue is to run without a --chromosomes flag, in which case, kallisto defaults to simply using the chromosome names in the GTF.

Troubleshooting Custom Transcriptomes/GTF

If you are generating your own GTF, it might be worth having a look at some compatibility guidelines I posted over on the kallisto repository. Simple sorting by seqname + start is not sufficient, e.g. exons must be rank-ordered.

ADD COMMENTlink modified 16 months ago • written 2.6 years ago by mmfansler350

The gencode annotation was the issue for me as well.

The following command edits the gencode fasta read headers to only include the first transcript field:

sed -r 's/^(>[^\|]+)\|.*/\1/g' gencode.v28.transcripts.fa > gencode.v28.transcripts.onlyTxLineHeaders.fa

After reindexing on this fasta, the "mapping to genome" step worked, and the output files had chromosome names and positions included.

ADD REPLYlink written 2.1 years ago by johnston.mike.j10

Okay, I will try this. I'm just wondering whether the --chromosomes flag is essential or not.

ADD REPLYlink written 2.5 years ago by wajm970

It is optional, but if you want an accurate header in the BAM you should use the --chromosomes flag. I tried running without it, and this resulted in all chromosomes being assigned LN:536870911 in the header. A chrom.sizes file is usually readily available. E.g., simple searching "mm10 chrom.sizes" gives me the file in the first link.

ADD REPLYlink written 2.5 years ago by mmfansler350
gravatar for wajm97
3.2 years ago by
wajm970 wrote:

Is this working properly? pseudobam and the script are not working for me.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by wajm970

You'll have to give more detail than that, or contact the author of the script.

ADD REPLYlink written 3.2 years ago by andrew.j.skelton736.0k

I thought could add gene annotation and somemore information to kallisto pseudobam output. enter image description here

I got this message.

ADD REPLYlink written 3.2 years ago by wajm970

and test another GTF file. enter image description here

ADD REPLYlink written 3.2 years ago by wajm970

Hi, the author is here. I wrote this script for one use only (I was not trying to make a very general resource for everybody, but maybe I should).

The problem is that you got differently formatted gtf file and I am using instead of some implemented parsers simply my own. These are lines parsing a gtf file into two dictionaries of transcript pointing to their position in genome (one dictionary for + strand and one for -):

pos_tr = {}
neg_tr = {}
with open(sys.argv[2],'r') as anotationfile:
    for entry in anotationfile:
        entry = entry.split('\t')
        key = entry[8].split('"')[1]
        if entry[6] == '+':
            pos_tr[key] = entry[3]
            neg_tr[key] = entry[3]

The quick solution would be just to find how to read the name of transcript and replace entry[8].split('"')[1] by by an expression what will get a transcript name for your version of gtf.

I also found that there are two bits, that will rename your reference to S5_genome. lines 21 - 23 :

if samline[:3] == '@HD':
        print(samline, end='')
                    print("@SQ  SN:S5_genome    LN:6778833")

and line 30

print('S5_genome', end="\t")

This was not intended as general utility, but a solution for a project during genomics course where we wanted to be able to inspect the job of kallisto (that's why the repo is called after the course : sequence a genome).

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by kamiljaron160
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1437 users visited in the last hour