Run Prediction program on entire folder.
1
0
Entering edit mode
5.5 years ago
enkh.tug • 0

Hello!

I need to run python script on whole proteome, which is ~3600 proteins. To run it I require two seperate files: fasta sequence and profiles file. They both same file name, based on their headers. How do I run it on two folders (one with fasta sequences, and other with profile files)? Do I have to modify script, or write a new one?

protein prediction • 1.7k views
0
Entering edit mode

Search for "bash loop files" and you'll find many hints, e.g., this one.

0
Entering edit mode

This should be reasonably straightforward to solve. Can you show the naming pattern of the fasta sequences and profile files? How do you know those files match to run the tool? Did you write the script yourself?

0
Entering edit mode

Script is not written by me. Fasta files are good, and profile files was generated by script creator. Naming pattern is: C_PROKKA_00001 - C_PROKKA_03211 for genome and pGX1_PROKKA for plasmids.

0
Entering edit mode

So if I understood correctly you have 3211 files with name C_PROKKA_00001 up to C_PROKKA_03211. And what about the plasmids? I don't see you mentioned those before. Is it just one file? Every C_PROKKA file has to be ran together with the pGX1_PROKKA?

0
Entering edit mode

I have 3151 chromosomal genome sequences and 521 plasmid sequences. I had multifasta file and I splitted it based on headers. Each fasta sequence has corresponding profile file, and they has to be run together. Besides, I need to concentrate output file in one text file, but if I set output file and run second sequence it overwrittes it. Thank you for Your interest in my problem.

0
Entering edit mode

I am not sure how you are using your command line but generally >> redirector will append data to an existing file.

0
Entering edit mode

Can you post an example of the names of

• a chromosomal genome sequence and it's profile file
• a plasmid sequence and it's profile file

It should be fairly easy to do this, for example see the answer of ole.tange for a gnu-parallel solution.

0
Entering edit mode

To set clear expectations: If you are doing this on a single computer bash loop will still need to be serial. Depending on how heavy the computation is in this case (and how many core you have available on this machine) you could look into using GNU Parallel to expedite this process to some extent.

1
Entering edit mode
5.5 years ago
ole.tange ★ 4.2k

I assume you can run:

my.py fasta/protein314.fa profile/protein314.profile


I also assume that the dir fasta and the dir profile contains nothing else and that the corresponding files are named the same

parallel my.py ::: fasta/* :::+ profile/*


The :::+ requires a fairly recent version of GNU Parallel. If you only have an older version:

parallel --xapply my.py ::: fasta/* ::: profile/*