Hi! I have tried searching for answers to this question, but all of the answers I find are written in a fashion that is out of my league. Most of the people asking the questions, in fact, know more than I do. When it comes to programming, I am a complete and utter noob.
I am trying to automate Trimmomatic to cut the adapters and improve quality on more than just one set of reads. I have quite a large set of forward and reverse RNA-seq fq.gz files and I need to process them. Currently, I have this code:
java -jar /home/Trimming/trimmomatic-0.38.jar PE -threads 24 -phred33 XX_2_1_1.fq.gz XX_2_1_2.fq.gz F2_1_paired.fq.gz F2_1_unpaired.fq.gz R2_1_paired.fq.gz R2_1_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5 &> trim.log &
I understand what this code means in terms of context and am fine running it for a single set of forward and reverse reads (here they are named XX_2_1_1.fq.gz and XX_2_1_2.fq.gz respectively.)
For any other total noobs like me reading this, my output files are F2_1_paired.fq.gz F2_1_unpaired.fq.gz for the forward reads and R2_1_paired.fq.gz R2_1_unpaired.fq.gz for the reverse reads. Problem is, those are not my only read sets. To avoid having to run this process over and over again, I would like to find a way to allow it to find all my reads that need trimming and then execute it while I do other things.
Unfortunately, as a non-programmer, I can't just write or modify code at will. I know how to automate, for example, FastQC by using parallel in this manner, where I find any files with a certain word or fragment in them and then bulk FastQC them:
find *paired*.gz | parallel 'fastqc {}' &
This is not directly transferable though. I can't just wildcard XX and then parallel everything to Trimmomatic, because unlike FastQC, Trimmomatic requires me to specify names for output files. Can someone please help me come up with code that will enable me to do this? I would like to learn it as well - understand what it means and the logical thought process that went into the design of it.
Thank you very much in advance!
Hi! Thank you for this suggestion. So if I understand correctly, I will use the following code?
Am I correct? I was told the Trimmomatic program requires java -jar plus the path name of its location as a prefix.
not sure where you get the
${f}
from, only the pattern{}
is needed.install trimmomatic with conda, now you don't need to worry about jars etc. in addition set the
--baseout
option, now you don't need to name the files anymore, they will automatically gain the desired filename prefix.As for the question, the way to troubleshoot GNU Parallel commands is to put an
echo
before the command that you are trying to execute. That way you can see what the command looks like when it would be submitted to the computer. Here I put anecho
in front of the commandjava foo
that say I'd want to troubleshoot:when the above runs it will print:
The lines above are what parallel would execute if there was no
echo
command in front. If the commands above are correct remove the echo.The same result as echo is obtained using
parallel --dryrun
.I can get rid of the $ symbols, no problem.
Conda.... is it like pip for Python? It seems to be a package installing tool and that won't be of much use to me, since I don't have permissions to install any packages on the server. For example, my attempts to install cutadapt resulted in utter failure since pip is not available to me as a student. I think conda is something similar? I am not sure, this is my first experience encountering either. As long as the files are .dmg, I can transfer them to folders and run them, but packages that aren't already on the server are out of reach for me. Trimmomatic is a dmg file, so I could install it locally and then transfer it to the server. I think. Filezilla does allow transfers. I can't do any of this locally, since the files are too big, so I am constrained to the school's server space and all the limitations that come with it. Can I still use jar? And the baseout option - where would I put it so that it would work? And do I just write --baseout? This is another issue I have - often the guides for a package or program will list options containing symbols other than those I actually need to write, for illustration purposes. I have no way of discerning proper code from improper code, I will often copy as is and run into trouble. Example: --someoption <filename>
I would instinctively keep the brackets around the filename, but once, when I did that, it was utterly wrong. Similarly, I have found other symbols that commonly mean something in code, used for other purposes (mainly in guides), like the | (pipe) symbol, even though piping was not called for. I have no way of identifying it as non-code and take it literally. Does that make sense?
It might be useful for you to join our biostars slack group biostar.slack.com: Chat for the biostars community
Sure, why not? Anything that helps.
conda
is likepip
, yes, only much, much easier to use (especially if you're a noob).It's also useful for precisely the reason that you don't have permission to install things, since it does everything inside your user directory. No special permissions needed. In fact, it may be one of the only ways you can install things as a non-admin user, so I do strongly urge you to try using it. The setup is very, very simple, I promise.
You just have to learn the conventions. The real problem is that they aren't universal. It's common, but not, always the case, to denote variables/arguments
<like so>
implying that what's between the brackets is just a placeholder (inclusive of the brackets).Istvan's suggestion of
parallel
is well worth learning, but it is a slightly more advanced technique, so there is perhaps an element of trying to run before walking here.Okay stupid noob question: Will Conda download packages remotely to the school server even though it is located on my computer? Or do I have to move it to my school user directory?
Are you trying to run these analyses on your laptop or on the server (the short answer is you could just do both).
Definitely not on my laptop - the files are too enormous and the computation too extreme for my poor computer. I only run things on the server. But I downloaded Anaconda2 to my laptop.
You need to download conda and install into your account on server at school. Then things you install via conda will then live in your personal space on the server.
I don't have permissions to install anything to my server. All of those sudo commands online that I find do not work for me because they require a password. Leaving out sudo just gets it to ask me if I am root. I think I was utterly confused at how I got Trimmomatic to work as well. From what I can see now, it is a jar file, not a dmg.
You shouldn't need sudo to install conda, and afterwards you can install everything. But we don't know your particular server of course. I've doing fine without sudo permission for years :)
Advantage of conda is that it does not need root privileges. You can install things in your own home directory. See this guide: Creating workflows with snakemake and conda
No sudo should be needed.
I downloaded conda to my directory and tried to install Trimmomatic. But I got this message:
Solving environment: failed
I was so happy that I got conda to install... and then this.
EDIT: I got it to work with this:
I don't think this is thread is the right place to learn about conda, troubleshoot your system, permissions etc.
in general, this site is not a forum for free-form back-and-forth chat. The threads should be focused on answering a specific question, and while some deviation is fine, this is now going towards the excessive.
if you have a new, specific question, it should be asked separately. This helps in keeping each page focused on a single topic.
This business with Conda is in response to your suggestion, however. Without conda, I will not be able to download Trimmomatic as a python package and can't try any of your code. While this thread may look chaotic, it has a flow. You suggested I run Trimmomatic without jar, someone else hinted at how I could do it and I am trying to get there. The end result will hopefully be me running parallel as a solution to Trimmomatic automation.
I now have conda and Trimmomatic installed as a package - I tried to run the code:
But it told me parallel: Warning: Input is read from the terminal. Only experts do this on purpose. Press CTRL-D to exit.
I kind of figured I would have a problem, even if I don't understand what kind, because I do not have an ids.txt file, nor do I know what one would put in an ids.text file. Googling showed me nothing but highly confusing things that make no sense to me at this stage. I am also not sure how baseout works - in the manual it says to provide a base file name for the outputs and it will then just automatically name them basename + 1P 1UP 2P 2UP for the four output files. Nice, but I have more than one set of reads and they don't all have the same base name. Do I just list the wildcard names for them?
EDIT: Is ids.txt just a list of the files I have? If so, do I just copy paste them into a textfile, regardless of formatting? I just made a file like this and saved it to the folder with the actual fq files (where I am doing my trimming). This time, the Trimmomatic process starts, but it tells me it cannot find any of my files and stops. The files are still there though.
UPDATE: Changed my ids.txt file around to contain only the unique prefixes, no file endings, listed on a separate line each, like this:
X2_C1
X2_C2
X2_C3
Then, I ran this code:
And it seems to be working. Now all I need to do is figure out the baseout thing. If anyone has any tips, I'd love that.