Question: Kallisto pseudobam to IGV
1
gravatar for mac
11 months ago by
mac60
UK
mac60 wrote:

Hello,

I am analysing some RNA-seq data with kallisto and I have created a series of sam files using --pseudobam in kallisto.

Is there a way of converting these files in a format that can be easily viewed on IGV?

Thank you in advance! :)

ADD COMMENTlink modified 4 months ago by mmfansler160 • written 11 months ago by mac60
2
gravatar for andrew.j.skelton73
11 months ago by
London
andrew.j.skelton735.0k wrote:

Per @Devon's answer here, you can use samtools to sort your SAM and convert it to BAM.

samtools view -u alignment.sam | samtools sort -@ 4 - output_prefix

Edit: The question is referring to transcriptomic vs genomic mappings for IGV

There's a python script available here found in this answer from SeqAnswers.

Usage:

python3 kallisto_sam_convertor.py pseudoalignment.sam annotation.gtf | samtools view -bS - | samtools sort - -o <output.bam>

bam should be loadable to IGV.

ADD COMMENTlink modified 11 months ago • written 11 months ago by andrew.j.skelton735.0k

Thank you for your reply but my understanding is that the kallisto psedobam (which is actually sam) have coordinates based on the transcriptome used to make the index.

How can I convert these to genomic coordinates so I can make an indexed bam to view in IGV?

ADD REPLYlink written 11 months ago by mac60

I've updated my answer

ADD REPLYlink written 11 months ago by andrew.j.skelton735.0k

Thank you, I will try this :)

ADD REPLYlink written 11 months ago by mac60
1
gravatar for mmfansler
4 months ago by
mmfansler160
MSKCC | New York, NY
mmfansler160 wrote:

Use --genomebam

Introduced in kallisto v0.44.0 is a --genomebam flag, which has quant generate a genome aligned BAM. You must provide an annotation GTF file via the --gtf flag. It is also recommended to include a chrom.sizes file with the --chromsomes flag.


No reads showing?

It might be worth mentioning that I wasn't able to immediately use GENCODE's files for mm10, e.g., gene GTF and transcripts FASTA. The kallisto indexer interpreted the full line from the FASTA read headers as the transcript IDs, but since the headers include much more, the pseduobam procedure couldn't resolve them to a transcript ID in the GTF. It still generated the full BAM, but everything was aligned to "*". Reprocessing the FASTA to only include the transcript ID (everything up to first "|") and reindexing on that got everything working.

ADD COMMENTlink modified 4 months ago • written 4 months ago by mmfansler160

Okay, I will try this. I'm just wondering whether the --chromosomes flag is essential or not.

ADD REPLYlink written 3 months ago by wajm970

It is optional, but if you want an accurate header in the BAM you should use the --chromosomes flag. I tried running without it, and this resulted in all chromosomes being assigned LN:536870911 in the header. A chrom.sizes file is usually readily available. E.g., simple searching "mm10 chrom.sizes" gives me the file in the first link.

ADD REPLYlink written 3 months ago by mmfansler160
0
gravatar for wajm97
10 months ago by
wajm970
wajm970 wrote:

Is this working properly? pseudobam and the script are not working for me.

ADD COMMENTlink modified 10 months ago • written 10 months ago by wajm970

You'll have to give more detail than that, or contact the author of the script.

ADD REPLYlink written 10 months ago by andrew.j.skelton735.0k

I thought kallisto_sam_convertor.py could add gene annotation and somemore information to kallisto pseudobam output. enter image description here

I got this message.

ADD REPLYlink written 10 months ago by wajm970

and test another GTF file. enter image description here

ADD REPLYlink written 10 months ago by wajm970
1

Hi, the author is here. I wrote this script for one use only (I was not trying to make a very general resource for everybody, but maybe I should).

The problem is that you got differently formatted gtf file and I am using instead of some implemented parsers simply my own. These are lines parsing a gtf file into two dictionaries of transcript pointing to their position in genome (one dictionary for + strand and one for -):

pos_tr = {}
neg_tr = {}
with open(sys.argv[2],'r') as anotationfile:
    for entry in anotationfile:
        entry = entry.split('\t')
        key = entry[8].split('"')[1]
        if entry[6] == '+':
            pos_tr[key] = entry[3]
        else:
            neg_tr[key] = entry[3]

The quick solution would be just to find how to read the name of transcript and replace entry[8].split('"')[1] by by an expression what will get a transcript name for your version of gtf.

I also found that there are two bits, that will rename your reference to S5_genome. lines 21 - 23 :

if samline[:3] == '@HD':
        print(samline, end='')
                    print("@SQ  SN:S5_genome    LN:6778833")

and line 30

print('S5_genome', end="\t")

This was not intended as general utility, but a solution for a project during genomics course where we wanted to be able to inspect the job of kallisto (that's why the repo is called after the course : sequence a genome).

ADD REPLYlink modified 10 months ago • written 10 months ago by kamiljaron100
0
gravatar for Antonio R. Franco
10 months ago by
Spain. Universidad de Córdoba
Antonio R. Franco3.6k wrote:

In addition..

Is that bam as accurate and useful than those obtained with splicing aware mappers like STAR o HISAT ?

ADD COMMENTlink written 10 months ago by Antonio R. Franco3.6k
1

Not really, no. As the pseudo alignment will only assign transcriptomic reads that it can definitively say belong to a given transcript, and the abundance estimation then uses an EM (expectation maximisation) function to basically fill in the gaps.

ADD REPLYlink written 10 months ago by andrew.j.skelton735.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1453 users visited in the last hour