Question: Kallisto will not produce .bam files
2
gravatar for Kristin Muench
2.3 years ago by
United States
Kristin Muench520 wrote:

Hello,

I'm trying to generate .bam files along with my kallisto output, as described here: https://pachterlab.github.io/kallisto/manual

Unfortunately, when I run this command:

kallisto quant -i $kallistoIdx -o $outputFileLoc/$sample --pseudobam -b 100 $Read1 $Read2

I only see the standard three outputs:

run_info.json
abundance.tsv
abundance.h5

Is it not sufficient to just include the --pseudobam flag? Is the threading (-b) command messing it up somehow?

Thanks for your help! Kristin

rna-seq kallisto • 1.7k views
ADD COMMENTlink modified 2.3 years ago by Kevin Blighe63k • written 2.3 years ago by Kristin Muench520

Which version of Kallisto are you using? I am using kallisto 0.44.0 and I tried to reproduce the issue you were mentioning and could not. I used the same command in OP (with input and output directories changed and added threads). Output included pseudoalignments.bam, as mentioned in the manual. I am on mint 18.3 (with ubuntu xenial base).

Input:

$ kallisto quant -i chr22_kallisto -o ./test --pseudobam -b 100 hcc1395_normal_rep1_r1.fastq.gz hcc1395_normal_rep1_r2.fastq.gz -t 2

output:

$ ls
abundance.h5  abundance.tsv  pseudoalignments.bam  run_info.json
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011213k

I reproduced the problem with my version, 0.43.1

Perhaps the group updated it in 0.44

ADD REPLYlink written 2.3 years ago by Kevin Blighe63k

Following bash loop had no problem in output (pseudoalignments.bam was present in designated folder) :

   $ for i in $(ls *.fastq.gz);do  mkdir ${i%_rep*}; kallisto quant -i chr22_kallisto -o ${i%_rep*} --pseudobam -b 100 ${i%r*}r1.fastq.gz ${i%r*}r2.fastq.gz -t 2;done

and parallel:

$ parallel echo '{= s/_rep[1-9].*//g =}' ::: *.gz | uniq | parallel 'mkdir {} && kallisto quant -i chr22_kallisto -o {} --pseudobam -b 100 {}_rep1_r1.fastq.gz {}_rep1_r2.fastq.gz -t 2'

input files :hcc1395_normal_rep1_r1.fastq.gz, hcc1395_normal_rep1_r1.fastq.gz.

output:

$ ls hcc1395_normal
abundance.h5  abundance.tsv  pseudoalignments.bam  run_info.json

Only problem with the script is that it tries to create same directory twice because of mkdir ${i%_rep*} and complains that directory already exists, when it tries to create second time. Other than that, there is no issue with kallisto. Standalone, bash loop and parallel (all the scripts) produced pseudoalignments.bam.

@ OP: echo input and output files and directories.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011213k

Yes, there is no issue with Kallisto. Just different behaviour across different versions. It does not help that the developers contradict themselves on their own website, evidently not updating all of their documentation in accordance with newer version changes.

ADD REPLYlink written 2.3 years ago by Kevin Blighe63k

The output of kallisto quant when run with the --pseudobam option is a SAM formatted stream. This stream can be redirected to a SAM file or directly to a BAM file by piping into samtools view -Sb - > output.bam. Examples of how to do this are shown in the Manual.

[source: https://pachterlab.github.io/kallisto/pseudobam.html]

...okay, let's go to the manual, as directed by the above statement:

------------------------

Pseudobam

--pseudobam outputs all pseudoalignments to a file pseudoalignments.bam in the output directory. This BAM file contains the pseudoalignments in BAM format, ordered by reads so that each pseudoalignment of a read is adjacent in the BAM file.

[source: https://pachterlab.github.io/kallisto/manual.html]

...ta-da!

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Kevin Blighe63k

OP's issue is that Kallisto outputs only three files instead of 4, missing pseudoalignments.bam and if bootstrapping somehow affects bam output. I am trying to convey (to OP) that with current version of kallisto and OP code, pseudoalignments.bam is generated with both standalone and loop scripts with bootstrapping. I have inquired about kallisto version used in script execution by OP.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011213k

Yes, and solved below.

ADD REPLYlink written 2.3 years ago by Kevin Blighe63k
4
gravatar for Kevin Blighe
2.3 years ago by
Kevin Blighe63k
Kevin Blighe63k wrote:

You're running that as a loop, right? The --pseudobam command line parameter outputs to terminal in SAM format. You won't see it if you are running that as a loop, or even as a shell script as they are run as separate processes.

Here's a test that I did on my own computer

./kallisto quant -i test/transcripts -o . --pseudobam \
test/reads_1.fastq.gz test/reads_2.fastq.gz | head -23

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 15
[index] number of k-mers: 18,902
[index] number of equivalence classes: 22
[quant] running in paired-end mode
[quant] will process pair 1: test/reads_1.fastq.gz
                             test/reads_2.fastq.gz
[quant] finding pseudoalignments for the reads ...@HD   VN:1.0
@SQ SN:NM_001168316 LN:2283
@SQ SN:NM_174914    LN:2385
@SQ SN:NR_031764    LN:1853
@SQ SN:NM_004503    LN:1681
@SQ SN:NM_006897    LN:1541
@SQ SN:NM_014212    LN:2037
@SQ SN:NM_014620    LN:2300
@SQ SN:NM_017409    LN:1959
@SQ SN:NM_017410    LN:2396
@SQ SN:NM_018953    LN:1612
@SQ SN:NM_022658    LN:2288
@SQ SN:NM_153633    LN:1666
@SQ SN:NM_153693    LN:2072
@SQ SN:NM_173860    LN:849
@SQ SN:NR_003084    LN:1640
@PG ID:kallisto PN:kallisto VN:0.43.1
1:NM_014620:16:182  99  NM_014620   17  255 50M =   149 182 GTTCCGAGCGCTCCGCAGAACAGTCCTCCCTGTAAGAGCCTAACCATTGC  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3
1:NM_014620:16:182  355 NM_153693   17  255 50M =   149 182 GTTCCGAGCGCTCCGCAGAACAGTCCTCCCTGTAAGAGCCTAACCATTGC  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3
1:NM_014620:16:182  355 NR_003084   17  255 50M =   149 182 GTTCCGAGCGCTCCGCAGAACAGTCCTCCCTGTAAGAGCCTAACCATTGC  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3
1:NM_014620:16:182  147 NM_014620   149 255 50M =   17  -182    TAATTTTTTTTCCTCCCAGGTGGAGTTGCCGAAGCTGGGGGCAGCTGGGG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3
1:NM_014620:16:182  403 NM_153693   149 255 50M =   17  -182    TAATTTTTTTTCCTCCCAGGTGGAGTTGCCGAAGCTGGGGGCAGCTGGGG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3
1:NM_014620:16:182  403 NR_003084   149 255 50M =   17  -182    TAATTTTTTTTCCTCCCAGGTGGAGTTGCCGAAGCTGGGGGCAGCTGGGG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NH:i:3

It still also produces the standard output files, as you've noticed:

ls -l
total 17084
-rw-rw-r-- 1 kblighe kblighe    35848 Apr 29 13:27 abundance.h5
-rw-rw-r-- 1 kblighe kblighe      591 Apr 29 13:27 abundance.tsv
-rwxr-xr-x 1 kblighe kblighe 17435774 Mar 20  2017 kallisto
-rw-r--r-- 1 kblighe kblighe     1366 Aug  1  2017 license.txt
-rw-r--r-- 1 kblighe kblighe     2253 Mar 20  2017 README.md
-rw-rw-r-- 1 kblighe kblighe      270 Apr 29 13:27 run_info.json
drwxr-xr-x 2 kblighe kblighe     4096 Apr 29 13:25 test

------------------------

So, in your loop, you will have to create a redirection into samtools to save as SAM/BAM

./kallisto quant -i test/transcripts -o . --pseudobam -b 100 \
test/reads_1.fastq.gz test/reads_2.fastq.gz | samtools view -bS - > align.bam

ls -l
total 17796
-rw-rw-r-- 1 kblighe kblighe    35848 Apr 29 13:50 abundance.h5
-rw-rw-r-- 1 kblighe kblighe      591 Apr 29 13:50 abundance.tsv
-rw-rw-r-- 1 kblighe kblighe   728763 Apr 29 13:50 align.bam
-rwxr-xr-x 1 kblighe kblighe 17435774 Mar 20  2017 kallisto
-rw-r--r-- 1 kblighe kblighe     1366 Aug  1  2017 license.txt
-rw-r--r-- 1 kblighe kblighe     2253 Mar 20  2017 README.md
-rw-rw-r-- 1 kblighe kblighe      270 Apr 29 13:50 run_info.json
drwxr-xr-x 2 kblighe kblighe     4096 Apr 29 13:25 test

For you:

kallisto quant -i "${kallistoIdx}" -o "${outputFileLoc}"/"${sample}" --pseudobam \
-b 100 "${Read1}" "${Read2}" | samtools view -bS - > "${sample}".bam

Kevin

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Kevin Blighe63k
1

Kallisto manual says

--pseudobam outputs all pseudoalignments to a file pseudoalignments.bam in the output directory.

Based on your test (and per OP) it does not happen in reality. Perhaps the behavior changed somewhere along the line but the manual was not updated.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax87k

Aha, I see! In fact, I did see the 'bam content' in my output as you predicted, not a separate pseudoalignments.bam file:

>> head kall_pipeScratch_1192561.out

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 176,984
[index] number of k-mers: 100,763,418
[index] number of equivalence classes: 671,993
[quant] running in paired-end mode
[quant] will process pair 1: batch2_pool1/P29-CGATGT_S1_R1_001.fastq
                             batch2_pool1/P29-CGATGT_S1_R2_001.fastq
[quant] finding pseudoalignments for the reads ...NS500735:535:H5KKMBGX3:1:12104:9668:6289  419 ENST00000543413 1154    255 76M =   1196    118 TGGAAGATTCAGATGGAGCTGTATTGTGTATGGATGCAAAGATCAATTTTGACTCTAATTCAGCCTATCGCCAAAA    AAAAAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEAEE<EEEEEEEEEEEE    NH:i:10

Thank you so much for suggesting the pipe method, will give it a try...

ADD REPLYlink written 2.3 years ago by Kristin Muench520

The documentation on their website is contradictory. Please see the thread with cpad (above). I am guessing that you are using a version pre-0.44, like me.

ADD REPLYlink written 2.3 years ago by Kevin Blighe63k

Just curious. Would the output of kallisto be fed into Samtools and possibly cause incorrect format?

For example this trunk of text:

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 15
[index] number of k-mers: 18,902
[index] number of equivalence classes: 22
[quant] running in paired-end mode
[quant] will process pair 1: test/reads_1.fastq.gz
                             test/reads_2.fastq.gz
[quant] finding pseudoalignments for the reads ...@HD   VN:1.0
ADD REPLYlink modified 20 months ago • written 20 months ago by izzy.yichao.cai170
1

In bioinformatics, anything is possible (and usually it is bad). What was your initial command?

ADD REPLYlink written 20 months ago by Kevin Blighe63k

Hi Kelvin,

I am using a similar command as the one in your answer:

## kallisto version 0.43.1
kallisto quant -i /path/to/index/hshg19_ensembl --pseudobam -b 100 --rf-stranded -o ./kallistoResults seqFiles/R1.fastq.gz seqFiles/R2.fastq.gz | samtools view -bS - > kallistoResults/pseudo-alignment.bam

Upon testing, the resulting bam seems to ignore the output by kallisto and only contain the psudoalignment output.

ADD REPLYlink modified 20 months ago • written 20 months ago by izzy.yichao.cai170

Yes, the idea is that it only outputs a pseudobam, which may not be compatible with downstream programs. I think that most people want to output a pseudobam from Kallisto for the purpose of viewing in IGV.

...or are you referring to something else? Can you paste the BAM header and the first few reads?

ADD REPLYlink written 20 months ago by Kevin Blighe63k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1188 users visited in the last hour