A general question about bash command
2
0
Entering edit mode
2.0 years ago

Hi. I am trying to align RNA-seq data via kallisto, but I think the answer to my question can be true for other similar situations.

I have 10 fastq.gz files, and I asked kallisto to process all of them at the same time with *.gz. So, it returned only 2 files: a .tsv and a .json;

whereas if I tell kallisto 10 times to process each fastq.gz file individually, it would return 10 .tsv and 10 .json. right?

I am wondering if the information I get in these situations are the same or not; and if either way is better for my downstream analysis.

Many thanks for your help.

RNA-seq GNU kallisto • 854 views
ADD COMMENT
2
Entering edit mode
2.0 years ago

think about that it does, one combines all data into a single file,

the other approach computes a separate count for each file

a different answer depending on what you need.

In general, it is likely you need to keep things separate to compare counts between files.

ADD COMMENT
0
Entering edit mode

Yes. I just wanted to know if any strategy is more preferred because this is the first time I am doing it. I'll keep them separate then. Thank you Istvan.

ADD REPLY
2
Entering edit mode

Istvan's answer is correct but I should mention a technical note: There is a way to run all samples together in a single kallisto run while maintaining the sample identity. This involves using the kallisto | bustools workflow (which I imagine will eventually be the standard workflow for running kallisto, even for bulk). This is advantageous in cases where you want to preserve the "raw" kallisto output, which are equivalence classes associated with transcript-compatibility counts (TCCs). Equivalence classes are different between different kallisto runs so if you're interested in TCCs (e.g. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0970-8 ), then you should run all your files together in a single batch.

(If this is confusing, don't worry about it)

ADD REPLY
0
Entering edit mode

Thanks dsull. Great to know about that. Much appreciated.

ADD REPLY
1
Entering edit mode
2.0 years ago
GenoMax 141k

The answer here is "check what the specific program does" or how it operates. It may be fine to do first, if the program creates separate output files/folders for every sample (I don't know how kalisto operates). An example of this would be running fastqc on a bunch of files by simply doing fastqc *. Doing this may run the job serially taking longer to complete but you would still get the results at the end.

On other hand a program like salmon creates output files that are identical for every sample. So if you ran it without segregating output into folders you would end up with a single set of output files (which may be mangled). There you would want to run it using the second option and specifying output directories for each run. Running the processes in parallel would allow you to complete the analysis quicker within the constraints of hardware/software (e.g. a job scheduler on a cluster) you are working with.

ADD COMMENT

Login before adding your answer.

Traffic: 1873 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6