Question: Run Prediction program on entire folder.
0
gravatar for enkh.tug
2.3 years ago by
enkh.tug0
enkh.tug0 wrote:

Hello!

I need to run python script on whole proteome, which is ~3600 proteins. To run it I require two seperate files: fasta sequence and profiles file. They both same file name, based on their headers. How do I run it on two folders (one with fasta sequences, and other with profile files)? Do I have to modify script, or write a new one?

protein prediction • 833 views
ADD COMMENTlink modified 2.3 years ago by ole.tange3.4k • written 2.3 years ago by enkh.tug0

Search for "bash loop files" and you'll find many hints, e.g., this one.

ADD REPLYlink written 2.3 years ago by h.mon24k

This should be reasonably straightforward to solve. Can you show the naming pattern of the fasta sequences and profile files? How do you know those files match to run the tool? Did you write the script yourself?

ADD REPLYlink written 2.3 years ago by WouterDeCoster37k

Script is not written by me. Fasta files are good, and profile files was generated by script creator. Naming pattern is: C_PROKKA_00001 - C_PROKKA_03211 for genome and pGX1_PROKKA for plasmids.

ADD REPLYlink written 2.3 years ago by enkh.tug0

So if I understood correctly you have 3211 files with name C_PROKKA_00001 up to C_PROKKA_03211. And what about the plasmids? I don't see you mentioned those before. Is it just one file? Every C_PROKKA file has to be ran together with the pGX1_PROKKA?

ADD REPLYlink written 2.3 years ago by WouterDeCoster37k

I have 3151 chromosomal genome sequences and 521 plasmid sequences. I had multifasta file and I splitted it based on headers. Each fasta sequence has corresponding profile file, and they has to be run together. Besides, I need to concentrate output file in one text file, but if I set output file and run second sequence it overwrittes it. Thank you for Your interest in my problem.

ADD REPLYlink written 2.3 years ago by enkh.tug0

I am not sure how you are using your command line but generally >> redirector will append data to an existing file.

ADD REPLYlink written 2.3 years ago by genomax64k

Can you post an example of the names of

  • a chromosomal genome sequence and it's profile file
  • a plasmid sequence and it's profile file

It should be fairly easy to do this, for example see the answer of ole.tange for a gnu-parallel solution.

ADD REPLYlink written 2.3 years ago by WouterDeCoster37k

To set clear expectations: If you are doing this on a single computer bash loop will still need to be serial. Depending on how heavy the computation is in this case (and how many core you have available on this machine) you could look into using GNU Parallel to expedite this process to some extent.

ADD REPLYlink written 2.3 years ago by genomax64k
1
gravatar for ole.tange
2.3 years ago by
ole.tange3.4k
Denmark
ole.tange3.4k wrote:

I assume you can run:

my.py fasta/protein314.fa profile/protein314.profile

I also assume that the dir fasta and the dir profile contains nothing else and that the corresponding files are named the same

parallel my.py ::: fasta/* :::+ profile/*

The :::+ requires a fairly recent version of GNU Parallel. If you only have an older version:

parallel --xapply my.py ::: fasta/* ::: profile/*
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by ole.tange3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1646 users visited in the last hour