Question: Kallisto will not produce .bam files
2
gravatar for Kristin Muench
18 months ago by
United States
Kristin Muench450 wrote:

Hello,

I'm trying to generate .bam files along with my kallisto output, as described here: https://pachterlab.github.io/kallisto/manual

Unfortunately, when I run this command:

kallisto quant -i $kallistoIdx -o $outputFileLoc/$sample --pseudobam -b 100 $Read1 $Read2

I only see the standard three outputs:

run_info.json
abundance.tsv
abundance.h5

Is it not sufficient to just include the --pseudobam flag? Is the threading (-b) command messing it up somehow?

Thanks for your help! Kristin

rna-seq kallisto • 1.1k views
ADD COMMENTlink modified 18 months ago by Kevin Blighe50k • written 18 months ago by Kristin Muench450

Which version of Kallisto are you using? I am using kallisto 0.44.0 and I tried to reproduce the issue you were mentioning and could not. I used the same command in OP (with input and output directories changed and added threads). Output included pseudoalignments.bam, as mentioned in the manual. I am on mint 18.3 (with ubuntu xenial base).

Input:

$ kallisto quant -i chr22_kallisto -o ./test --pseudobam -b 100 hcc1395_normal_rep1_r1.fastq.gz hcc1395_normal_rep1_r2.fastq.gz -t 2

output:

$ ls
abundance.h5  abundance.tsv  pseudoalignments.bam  run_info.json
ADD REPLYlink modified 18 months ago • written 18 months ago by cpad011212k

I reproduced the problem with my version, 0.43.1

Perhaps the group updated it in 0.44

ADD REPLYlink written 18 months ago by Kevin Blighe50k

Following bash loop had no problem in output (pseudoalignments.bam was present in designated folder) :

   $ for i in $(ls *.fastq.gz);do  mkdir ${i%_rep*}; kallisto quant -i chr22_kallisto -o ${i%_rep*} --pseudobam -b 100 ${i%r*}r1.fastq.gz ${i%r*}r2.fastq.gz -t 2;done

and parallel:

$ parallel echo '{= s/_rep[1-9].*//g =}' ::: *.gz | uniq | parallel 'mkdir {} && kallisto quant -i chr22_kallisto -o {} --pseudobam -b 100 {}_rep1_r1.fastq.gz {}_rep1_r2.fastq.gz -t 2'

input files :hcc1395_normal_rep1_r1.fastq.gz, hcc1395_normal_rep1_r1.fastq.gz.

output:

$ ls hcc1395_normal
abundance.h5  abundance.tsv  pseudoalignments.bam  run_info.json

Only problem with the script is that it tries to create same directory twice because of mkdir ${i%_rep*} and complains that directory already exists, when it tries to create second time. Other than that, there is no issue with kallisto. Standalone, bash loop and parallel (all the scripts) produced pseudoalignments.bam.

@ OP: echo input and output files and directories.

ADD REPLYlink modified 18 months ago • written 18 months ago by cpad011212k

Yes, there is no issue with Kallisto. Just different behaviour across different versions. It does not help that the developers contradict themselves on their own website, evidently not updating all of their documentation in accordance with newer version changes.

ADD REPLYlink written 18 months ago by Kevin Blighe50k

The output of kallisto quant when run with the --pseudobam option is a SAM formatted stream. This stream can be redirected to a SAM file or directly to a BAM file by piping into samtools view -Sb - > output.bam. Examples of how to do this are shown in the Manual.

[source: https://pachterlab.github.io/kallisto/pseudobam.html]

...okay, let's go to the manual, as directed by the above statement:

------------------------

Pseudobam

--pseudobam outputs all pseudoalignments to a file pseudoalignments.bam in the output directory. This BAM file contains the pseudoalignments in BAM format, ordered by reads so that each pseudoalignment of a read is adjacent in the BAM file.

[source: https://pachterlab.github.io/kallisto/manual.html]

...ta-da!

ADD REPLYlink modified 18 months ago • written 18 months ago by Kevin Blighe50k

OP's issue is that Kallisto outputs only three files instead of 4, missing pseudoalignments.bam and if bootstrapping somehow affects bam output. I am trying to convey (to OP) that with current version of kallisto and OP code, pseudoalignments.bam is generated with both standalone and loop scripts with bootstrapping. I have inquired about kallisto version used in script execution by OP.

ADD REPLYlink modified 18 months ago • written 18 months ago by cpad011212k

Yes, and solved below.

ADD REPLYlink written 18 months ago by Kevin Blighe50k
4
gravatar for Kevin Blighe
18 months ago by
Kevin Blighe50k
Kevin Blighe50k wrote:

You're running that as a loop, right? The --pseudobam command line parameter outputs to terminal in SAM format. You won't see it if you are running that as a loop, or even as a shell script as they are run as separate processes.

Here's a test that I did on my own computer

./kallisto quant -i test/transcripts -o . --pseudobam \
test/reads_1.fastq.gz test/reads_2.fastq.gz | head -23

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 15
[index] number of k-mers: 18,902
[index] number of equivalence classes: 22
[quant] running in paired-end mode
[quant] will process pair 1: test/reads_1.fastq.gz
                             test/reads_2.fastq.gz
[quant] finding pseudoalignments for the reads ...@HD   VN:1.0
@SQ SN:NM_001168316 LN:2283
@SQ SN:NM_174914    LN:2385
@SQ SN:NR_031764    LN:1853
@SQ SN:NM_004503    LN:1681
@SQ SN:NM_006897    LN:1541
@SQ SN:NM_014212    LN:2037
@SQ SN:NM_014620    LN:2300
@SQ SN:NM_017409    LN:1959
@SQ SN:NM_017410    LN:2396
@SQ SN:NM_018953    LN:1612
@SQ SN:NM_022658    LN:2288
@SQ SN:NM_153633    LN:1666
@SQ SN:NM_153693    LN:2072
@SQ SN:NM_173860    LN:849
@SQ SN:NR_003084    LN:1640
@PG ID:kallisto PN:kallisto VN:0.43.1
1:NM_014620:16:182  99  NM_014620   17  255 50M =   149 182 GTTCCGAGCGCTCCGCAGAACAGTCCTCCCTGTAAGAGCCTAACCATTGC  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3
1:NM_014620:16:182  355 NM_153693   17  255 50M =   149 182 GTTCCGAGCGCTCCGCAGAACAGTCCTCCCTGTAAGAGCCTAACCATTGC  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3
1:NM_014620:16:182  355 NR_003084   17  255 50M =   149 182 GTTCCGAGCGCTCCGCAGAACAGTCCTCCCTGTAAGAGCCTAACCATTGC  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3
1:NM_014620:16:182  147 NM_014620   149 255 50M =   17  -182    TAATTTTTTTTCCTCCCAGGTGGAGTTGCCGAAGCTGGGGGCAGCTGGGG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3
1:NM_014620:16:182  403 NM_153693   149 255 50M =   17  -182    TAATTTTTTTTCCTCCCAGGTGGAGTTGCCGAAGCTGGGGGCAGCTGGGG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3
1:NM_014620:16:182  403 NR_003084   149 255 50M =   17  -182    TAATTTTTTTTCCTCCCAGGTGGAGTTGCCGAAGCTGGGGGCAGCTGGGG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3

It still also produces the standard output files, as you've noticed:

ls -l
total 17084
-rw-rw-r-- 1 kblighe kblighe    35848 Apr 29 13:27 abundance.h5
-rw-rw-r-- 1 kblighe kblighe      591 Apr 29 13:27 abundance.tsv
-rwxr-xr-x 1 kblighe kblighe 17435774 Mar 20  2017 kallisto
-rw-r--r-- 1 kblighe kblighe     1366 Aug  1  2017 license.txt
-rw-r--r-- 1 kblighe kblighe     2253 Mar 20  2017 README.md
-rw-rw-r-- 1 kblighe kblighe      270 Apr 29 13:27 run_info.json
drwxr-xr-x 2 kblighe kblighe     4096 Apr 29 13:25 test

------------------------

So, in your loop, you will have to create a redirection into samtools to save as SAM/BAM

./kallisto quant -i test/transcripts -o . --pseudobam -b 100 \
test/reads_1.fastq.gz test/reads_2.fastq.gz | samtools view -bS - > align.bam

ls -l
total 17796
-rw-rw-r-- 1 kblighe kblighe    35848 Apr 29 13:50 abundance.h5
-rw-rw-r-- 1 kblighe kblighe      591 Apr 29 13:50 abundance.tsv
-rw-rw-r-- 1 kblighe kblighe   728763 Apr 29 13:50 align.bam
-rwxr-xr-x 1 kblighe kblighe 17435774 Mar 20  2017 kallisto
-rw-r--r-- 1 kblighe kblighe     1366 Aug  1  2017 license.txt
-rw-r--r-- 1 kblighe kblighe     2253 Mar 20  2017 README.md
-rw-rw-r-- 1 kblighe kblighe      270 Apr 29 13:50 run_info.json
drwxr-xr-x 2 kblighe kblighe     4096 Apr 29 13:25 test

For you:

kallisto quant -i "${kallistoIdx}" -o "${outputFileLoc}"/"${sample}" --pseudobam \
-b 100 "${Read1}" "${Read2}" | samtools view -bS - > "${sample}".bam

Kevin

ADD COMMENTlink modified 18 months ago • written 18 months ago by Kevin Blighe50k
1

Kallisto manual says

--pseudobam outputs all pseudoalignments to a file pseudoalignments.bam in the output directory.

Based on your test (and per OP) it does not happen in reality. Perhaps the behavior changed somewhere along the line but the manual was not updated.

ADD REPLYlink modified 18 months ago • written 18 months ago by genomax73k

Aha, I see! In fact, I did see the 'bam content' in my output as you predicted, not a separate pseudoalignments.bam file:

>> head kall_pipeScratch_1192561.out

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 176,984
[index] number of k-mers: 100,763,418
[index] number of equivalence classes: 671,993
[quant] running in paired-end mode
[quant] will process pair 1: batch2_pool1/P29-CGATGT_S1_R1_001.fastq
                             batch2_pool1/P29-CGATGT_S1_R2_001.fastq
[quant] finding pseudoalignments for the reads ...NS500735:535:H5KKMBGX3:1:12104:9668:6289  419 ENST00000543413 1154    255 76M =   1196    118 TGGAAGATTCAGATGGAGCTGTATTGTGTATGGATGCAAAGATCAATTTTGACTCTAATTCAGCCTATCGCCAAAA    AAAAAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEAEE<EEEEEEEEEEEE    NH:i:10

Thank you so much for suggesting the pipe method, will give it a try...

ADD REPLYlink written 18 months ago by Kristin Muench450

The documentation on their website is contradictory. Please see the thread with cpad (above). I am guessing that you are using a version pre-0.44, like me.

ADD REPLYlink written 18 months ago by Kevin Blighe50k

Just curious. Would the output of kallisto be fed into Samtools and possibly cause incorrect format?

For example this trunk of text:

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 15
[index] number of k-mers: 18,902
[index] number of equivalence classes: 22
[quant] running in paired-end mode
[quant] will process pair 1: test/reads_1.fastq.gz
                             test/reads_2.fastq.gz
[quant] finding pseudoalignments for the reads ...@HD   VN:1.0
ADD REPLYlink modified 11 months ago • written 11 months ago by izzy.yichao.cai160
1

In bioinformatics, anything is possible (and usually it is bad). What was your initial command?

ADD REPLYlink written 11 months ago by Kevin Blighe50k

Hi Kelvin,

I am using a similar command as the one in your answer:

## kallisto version 0.43.1
kallisto quant -i /path/to/index/hshg19_ensembl --pseudobam -b 100 --rf-stranded -o ./kallistoResults seqFiles/R1.fastq.gz seqFiles/R2.fastq.gz | samtools view -bS - > kallistoResults/pseudo-alignment.bam

Upon testing, the resulting bam seems to ignore the output by kallisto and only contain the psudoalignment output.

ADD REPLYlink modified 11 months ago • written 11 months ago by izzy.yichao.cai160

Yes, the idea is that it only outputs a pseudobam, which may not be compatible with downstream programs. I think that most people want to output a pseudobam from Kallisto for the purpose of viewing in IGV.

...or are you referring to something else? Can you paste the BAM header and the first few reads?

ADD REPLYlink written 11 months ago by Kevin Blighe50k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2738 users visited in the last hour