Question: Build a Kallisto transcriptome index
0
gravatar for F. Golestan
11 months ago by
F. Golestan60
F. Golestan60 wrote:

Hello,

I need to pseudo-align my paired reads to the transcriptome using Kallisto. I know that Kallisto does not use a reference genome sequence, and instead it performs pseudo-alignment to determine the compatibility of reads with targets (e.g. transcript sequences).

However, to determine the compatibility of reads with target transcript sequences (to build a Kallisto transcriptome index), how can I choose my targeted reference transcriptome which is human and also Cassava Brown Streak Virus?

I mean, for running the below codes to create the Kallisto index from the transcriptome, should I specify which transcriptome I want to use (e.g. for human or for Cassava Brown Streak Virus)? If so, how to know what is the appropriate transcriptome that I should use for my targeted genomes?

cd
kallisto index -i Potra01-mRNA.idx \
~/share/Day01/data/reference/fasta/Potra01-mRNA.fa.gz

Thank you so much for your advise and guide. Best wishes

ADD COMMENTlink modified 11 months ago by Lior Pachter520 • written 11 months ago by F. Golestan60
0
gravatar for Lior Pachter
11 months ago by
Lior Pachter520
United States
Lior Pachter520 wrote:

It sounds like your goal is to build an index from both the human and the Cassava Brown Steak Virus at the same time. You can do this by obtaining the transcriptomes for each separately, and then building an index using both files: kallisto index -i name.idx human.fa.gz cassava_brown_steak_virus.fa.gz. You can then quantify reads against both simultaneously.

ADD COMMENTlink written 11 months ago by Lior Pachter520
1

The actual fasta files can be downloaded from public data bases such as Ensembl, as described here and here. You want to look for the cDNA bit in the file name since you want to limit yourself to those parts of the genome that refer to the transcribed loci.

ADD REPLYlink modified 11 months ago • written 11 months ago by Friederike5.7k

Thanks a lot Friederike for your guide. After downloading transcriptome fasta files, then, the name of fasta file would be for fa.gz file? what about name.idx?

Many thanks.

ADD REPLYlink written 11 months ago by F. Golestan60
1

I believe Lior was just trying to indicate that you can put whatever name you want the resulting index to have following --i.

I.e., if you want two indeces, one for the human, one for the virus cDNA libraries, you will run the command twice:

kallisto index -i my_human_index.idx name_of_the_fasta_file_for_the_human_cDNA_collection.gz # generates the index to be used with the human samples

kallisto index -i my_virus_index.idx name_of_the_fasta_file_for_the_virus_cDNA_collection.gz # generates the index to be used with the virus samples

ADD REPLYlink written 11 months ago by Friederike5.7k

Many thanks Friederike. I could find fasta files for human and also plants, and I did indexing for them. However, I could not find transcriptome fasta file for Cassava Brown Streak Virus or its close species (TAN70 virus). I would highly appreciate if you can help me from where I can get it.

Many thanks.

ADD REPLYlink written 11 months ago by F. Golestan60
1

Sorry, I've never had to download a viral cDNA index, so I'd have to resort to the usual tools (google etc.) just like you.

ADD REPLYlink written 11 months ago by Friederike5.7k

OK. Thank you very much Friederike.

ADD REPLYlink written 11 months ago by F. Golestan60

Thank you very much Lior. In fact, I want to build an index from both the human and the Cassava Brown Steak Virus separately. I have two different RNA-seq datasets (one for human and another one for Cassava Brown Steak Virus). I need to know how can I obtain transcriptomes for human and the Cassava Brown Steak Virus separately?

Then, I want to know what should I exactly write for name.idx and both fa.gz files for human and the Cassava Brown Steak Virus separately?

Many thanks for the help.

ADD REPLYlink written 11 months ago by F. Golestan60

I tried this first...but it resulted in an index around 2.3Gb that failed in the kallisto quant step (and had weird errors like bazillions of equivalence classes, ran out of memory, etc)

kallisto index -i GRch38_GRCm38_cdna.idx Homo_sapiens.GRCh38.cdna.all.fa.gz Mus_musculus.GRCm38.cdna.all.fa.gz 

But then I unzipped and concatenated them and tried again...and got an index of 4.4Gb, and it worked with the kallisto quant step

gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz
gunzip Mus_musculus.GRCm38.cdna.all.fa.gz
cat Homo_sapiens.GRCh38.cdna.all.fa Mus_musculus.GRCm38.cdna.all.fa > GRch38_GRCm38_cdna.fa
kallisto index -i GRch38_GRCm38_cdna.idx GRch38_GRCm38_cdna.fa

Also, don't forget about the 'kallisto inspect' feature..this was helpful to run the 'kallisto inspect' on the new index, without having to run a kallisto quant run

kallisto inspect GRch38_GRCm38_cdna.idx
ADD REPLYlink modified 4 months ago • written 4 months ago by Dylan Richards0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1166 users visited in the last hour