Question: Automating Trimmomatic for total Noobs
1
gravatar for new15b
9 months ago by
new15b30
new15b30 wrote:

Hi! I have tried searching for answers to this question, but all of the answers I find are written in a fashion that is out of my league. Most of the people asking the questions, in fact, know more than I do. When it comes to programming, I am a complete and utter noob.

I am trying to automate Trimmomatic to cut the adapters and improve quality on more than just one set of reads. I have quite a large set of forward and reverse RNA-seq fq.gz files and I need to process them. Currently, I have this code:

java -jar /home/Trimming/trimmomatic-0.38.jar PE -threads 24 -phred33 XX_2_1_1.fq.gz XX_2_1_2.fq.gz F2_1_paired.fq.gz F2_1_unpaired.fq.gz R2_1_paired.fq.gz R2_1_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5 &> trim.log &

I understand what this code means in terms of context and am fine running it for a single set of forward and reverse reads (here they are named XX_2_1_1.fq.gz and XX_2_1_2.fq.gz respectively.)

For any other total noobs like me reading this, my output files are F2_1_paired.fq.gz F2_1_unpaired.fq.gz for the forward reads and R2_1_paired.fq.gz R2_1_unpaired.fq.gz for the reverse reads. Problem is, those are not my only read sets. To avoid having to run this process over and over again, I would like to find a way to allow it to find all my reads that need trimming and then execute it while I do other things.

Unfortunately, as a non-programmer, I can't just write or modify code at will. I know how to automate, for example, FastQC by using parallel in this manner, where I find any files with a certain word or fragment in them and then bulk FastQC them:

find *paired*.gz | parallel 'fastqc {}' &

This is not directly transferable though. I can't just wildcard XX and then parallel everything to Trimmomatic, because unlike FastQC, Trimmomatic requires me to specify names for output files. Can someone please help me come up with code that will enable me to do this? I would like to learn it as well - understand what it means and the logical thought process that went into the design of it.

Thank you very much in advance!

rna-seq • 1.3k views
ADD COMMENTlink modified 9 months ago by Istvan Albert ♦♦ 81k • written 9 months ago by new15b30
5
gravatar for Istvan Albert
9 months ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

read up on how GNU parallel works and you will see that it can be made to generate inputs by groups for example

seq 1 10 | parallel -n 2 echo {1} {2}

will produce

1 2
3 4
5 6
7 8
9 10

which can solve your problem.

That being said don't run bioinformatics tools on file "patterns" like ls *.fq or find *.fq The order of the files may not be what you expect and will eventually cause major headaches.

As a rule try to compile lists of "root" names or ids that fully identify the files that you want to process. For example, say the root is SRR12234, then build your command around these known entities

cat ids.txt | parallel "trimmomatic {}_1.fq {}_2.fq" 
cat ids.txt | parallel "bwa mem hg38 {}_1.fq {}_2.fq > {}.sam"

and so on. Basically, it is you who should explicitly create the file names from known information rather than crawling the directory for file names. Sticking to this simple tip can save you a lot of trouble.

See more examples here Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

ADD COMMENTlink modified 9 months ago • written 9 months ago by Istvan Albert ♦♦ 81k

Hi! Thank you for this suggestion. So if I understand correctly, I will use the following code?

cat ids.txt | parallel "java -jar /*filepathTrimmomatic/trimmomatic-0.38.jar PE -phred33 ${f}_R1.fq.gz ${f}_R2.fq.gz ${f}_F_paired.fq.gz ${f}_F_unpaired.fq.gz ${f}_R_paired.fq.gz ${f}_R_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5"

Am I correct? I was told the Trimmomatic program requires java -jar plus the path name of its location as a prefix.

ADD REPLYlink written 9 months ago by new15b30

not sure where you get the ${f} from, only the pattern {} is needed.

install trimmomatic with conda, now you don't need to worry about jars etc. in addition set the --baseout option, now you don't need to name the files anymore, they will automatically gain the desired filename prefix.

As for the question, the way to troubleshoot GNU Parallel commands is to put an echo before the command that you are trying to execute. That way you can see what the command looks like when it would be submitted to the computer. Here I put an echo in front of the command java foo that say I'd want to troubleshoot:

seq  1 10 | parallel 'echo java foo {}_1.fq {}_2.bar'

when the above runs it will print:

java foo 1_1.fq 1_2.bar
java foo 2_1.fq 2_2.bar
java foo 3_1.fq 3_2.bar
java foo 4_1.fq 4_2.bar
java foo 5_1.fq 5_2.bar
java foo 6_1.fq 6_2.bar
java foo 7_1.fq 7_2.bar
java foo 8_1.fq 8_2.bar
java foo 9_1.fq 9_2.bar
java foo 10_1.fq 10_2.bar

The lines above are what parallel would execute if there was no echo command in front. If the commands above are correct remove the echo.

ADD REPLYlink modified 9 months ago • written 9 months ago by Istvan Albert ♦♦ 81k
1

The same result as echo is obtained using parallel --dryrun.

ADD REPLYlink written 9 months ago by WouterDeCoster42k

I can get rid of the $ symbols, no problem.

Conda.... is it like pip for Python? It seems to be a package installing tool and that won't be of much use to me, since I don't have permissions to install any packages on the server. For example, my attempts to install cutadapt resulted in utter failure since pip is not available to me as a student. I think conda is something similar? I am not sure, this is my first experience encountering either. As long as the files are .dmg, I can transfer them to folders and run them, but packages that aren't already on the server are out of reach for me. Trimmomatic is a dmg file, so I could install it locally and then transfer it to the server. I think. Filezilla does allow transfers. I can't do any of this locally, since the files are too big, so I am constrained to the school's server space and all the limitations that come with it. Can I still use jar? And the baseout option - where would I put it so that it would work? And do I just write --baseout? This is another issue I have - often the guides for a package or program will list options containing symbols other than those I actually need to write, for illustration purposes. I have no way of discerning proper code from improper code, I will often copy as is and run into trouble. Example: --someoption <filename>

I would instinctively keep the brackets around the filename, but once, when I did that, it was utterly wrong. Similarly, I have found other symbols that commonly mean something in code, used for other purposes (mainly in guides), like the | (pipe) symbol, even though piping was not called for. I have no way of identifying it as non-code and take it literally. Does that make sense?

ADD REPLYlink written 9 months ago by new15b30

It might be useful for you to join our biostars slack group biostar.slack.com: Chat for the biostars community

ADD REPLYlink written 9 months ago by WouterDeCoster42k
1

Sure, why not? Anything that helps.

ADD REPLYlink written 9 months ago by new15b30

conda is like pip, yes, only much, much easier to use (especially if you're a noob).

It's also useful for precisely the reason that you don't have permission to install things, since it does everything inside your user directory. No special permissions needed. In fact, it may be one of the only ways you can install things as a non-admin user, so I do strongly urge you to try using it. The setup is very, very simple, I promise.

You just have to learn the conventions. The real problem is that they aren't universal. It's common, but not, always the case, to denote variables/arguments <like so> implying that what's between the brackets is just a placeholder (inclusive of the brackets).

Istvan's suggestion of parallel is well worth learning, but it is a slightly more advanced technique, so there is perhaps an element of trying to run before walking here.

ADD REPLYlink written 9 months ago by Joe15k

Okay stupid noob question: Will Conda download packages remotely to the school server even though it is located on my computer? Or do I have to move it to my school user directory?

ADD REPLYlink written 9 months ago by new15b30

Are you trying to run these analyses on your laptop or on the server (the short answer is you could just do both).

ADD REPLYlink written 9 months ago by Joe15k

Definitely not on my laptop - the files are too enormous and the computation too extreme for my poor computer. I only run things on the server. But I downloaded Anaconda2 to my laptop.

ADD REPLYlink modified 9 months ago • written 9 months ago by new15b30

You need to download conda and install into your account on server at school. Then things you install via conda will then live in your personal space on the server.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax75k

I don't have permissions to install anything to my server. All of those sudo commands online that I find do not work for me because they require a password. Leaving out sudo just gets it to ask me if I am root. I think I was utterly confused at how I got Trimmomatic to work as well. From what I can see now, it is a jar file, not a dmg.

ADD REPLYlink written 9 months ago by new15b30

You shouldn't need sudo to install conda, and afterwards you can install everything. But we don't know your particular server of course. I've doing fine without sudo permission for years :)

ADD REPLYlink written 9 months ago by WouterDeCoster42k

Advantage of conda is that it does not need root privileges. You can install things in your own home directory. See this guide: Creating workflows with snakemake and conda

No sudo should be needed.

ADD REPLYlink written 9 months ago by genomax75k
1

I downloaded conda to my directory and tried to install Trimmomatic. But I got this message:

Solving environment: failed

I was so happy that I got conda to install... and then this.

EDIT: I got it to work with this:

conda install -c bioconda trimmomatic
ADD REPLYlink modified 9 months ago • written 9 months ago by new15b30

I don't think this is thread is the right place to learn about conda, troubleshoot your system, permissions etc.

in general, this site is not a forum for free-form back-and-forth chat. The threads should be focused on answering a specific question, and while some deviation is fine, this is now going towards the excessive.

if you have a new, specific question, it should be asked separately. This helps in keeping each page focused on a single topic.

ADD REPLYlink modified 9 months ago • written 9 months ago by Istvan Albert ♦♦ 81k

This business with Conda is in response to your suggestion, however. Without conda, I will not be able to download Trimmomatic as a python package and can't try any of your code. While this thread may look chaotic, it has a flow. You suggested I run Trimmomatic without jar, someone else hinted at how I could do it and I am trying to get there. The end result will hopefully be me running parallel as a solution to Trimmomatic automation.

ADD REPLYlink written 9 months ago by new15b30

I now have conda and Trimmomatic installed as a package - I tried to run the code:

cat ids.txt | parallel 'trimmomatic PE {}R1*.f*q.gz {}R2*.f*q.gz {}pair_R1.fq.gz {}unpair_R1.fq.gz {}pair_R2.fq.gz {}unpair_R2.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5'

But it told me parallel: Warning: Input is read from the terminal. Only experts do this on purpose. Press CTRL-D to exit.

I kind of figured I would have a problem, even if I don't understand what kind, because I do not have an ids.txt file, nor do I know what one would put in an ids.text file. Googling showed me nothing but highly confusing things that make no sense to me at this stage. I am also not sure how baseout works - in the manual it says to provide a base file name for the outputs and it will then just automatically name them basename + 1P 1UP 2P 2UP for the four output files. Nice, but I have more than one set of reads and they don't all have the same base name. Do I just list the wildcard names for them?

EDIT: Is ids.txt just a list of the files I have? If so, do I just copy paste them into a textfile, regardless of formatting? I just made a file like this and saved it to the folder with the actual fq files (where I am doing my trimming). This time, the Trimmomatic process starts, but it tells me it cannot find any of my files and stops. The files are still there though.

UPDATE: Changed my ids.txt file around to contain only the unique prefixes, no file endings, listed on a separate line each, like this:


X2_C1

X2_C2

X2_C3

Then, I ran this code:

cat ids.txt | parallel --jobs 8 'trimmomatic PE {}_R1.fq.gz {}_R2.fq.gz {}pair_R1.fq.gz {}unpair_R1.fq.gz {}pair_R2.fq.gz {}unpair_R2.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5'

And it seems to be working. Now all I need to do is figure out the baseout thing. If anyone has any tips, I'd love that.

ADD REPLYlink modified 9 months ago • written 9 months ago by new15b30
1
gravatar for genomax
9 months ago by
genomax75k
United States
genomax75k wrote:

See these past threads for assistance with this:
Linux for loop for 2 inputs and 4 outputs for Trimmomatic fastq quality trimming
Trimmomatic job script to run on multiple pair end read file

Hint: Put the word echo before trimmomatic command in the loops to print out the actual command lines. Verify they look good. If they do, then remove the word echo to actually run them.

ADD COMMENTlink modified 9 months ago • written 9 months ago by genomax75k

And with parallel you can use --dryrun

ADD REPLYlink written 9 months ago by WouterDeCoster42k

I'd tried to make sense of those and couldn't. I cannot, for the life of me, apply this to my data and get it to work. I even changed the names of my files to X2_C1_R1 etc. to make it easier. A friend of mine tried to play around with it and came up with the following:

for f in $(ls *.fq.gz | sed -e 's/_1.fq.gz//' -e 's/_2.fq.gz' | sort -u)
do
java -jar /usr/local/bin/Trimmed/trimmomatic-0.38.jar PE -phred33 ${f}_1.fq.gz ${f}_2.fq.gz ${f}_F_paired.fq.gz ${f}_F_unpaired.fq.gz ${f}_R_paired.fq.gz ${f}_R_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5
done

I tried putting that into a py file, but the thing won't run. Although he knows vastly more than I, neither one of us has the technical knowledge to straighten this mess out. Looking at forums where other people solved their own specific case scenarios isn't helping us, since we don't know how to modify their success into one for ourselves. I am very new to this.

ADD REPLYlink modified 9 months ago by Joe15k • written 9 months ago by new15b30

Alright, we got it to run for these names (well it has processed C1 and C2 so far): X2_C1_R1 X2_C1_R2 X2_C2_R1 X2_C2_R2 X2_C3_R1 X2_C3_R2

The code:

for f in $(ls *.fq.gz | sed -e 's/_R1.fq.gz//' -e 's/_R2.fq.gz//' | sort -u)
do
java -jar /*filepathTrimmomatic/trimmomatic-0.38.jar PE -phred33 ${f}_R1.fq.gz ${f}_R2.fq.gz ${f}_F_paired.fq.gz ${f}_F_unpaired.fq.gz ${f}_R_paired.fq.gz ${f}_R_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5
done

I copy pasted this into Text Wrangler, which seems to understand Python spacing requirements. For all you noobs out there, the first line of code translates into the following: We are defining "f" as elements from a list of everything in the folder that has an ending .fq.gz. Then we are taking these strings and sending them to sed, which now extracts everything in their names aside from R1.fq.gz and R2.fq.gz. Then we send those strings off to sort, where only unique names are selected - this means we are getting rid of dupes. In our case this left us with three unique strings: X2_C1, X2_C2 and X2_C3. This is then all sent to the Trimmomatic command where every "f" is now sequentially replaced by one of the strings we defined in the first process through sed and sort. During the first loop, everything receives an X2_C1, during the second loop it is an X2_C2 and finally an X2_C3. What happens is that when X2_C1 is printed to the input files {f}_1.fq.gz and {f}_2.fq.gz in place of "f", these files are used to generate the output files, by replacing "f" with X2_C1 as well.

ADD REPLYlink modified 9 months ago • written 9 months ago by new15b30
1

You aren’t using python, so ignore it completely for the time being.

What you’re doing/attempting is to call programs (namely a Java program) via the commandline which understands the bash language. There is precisely 0 python at work here at the moment.

Putting code in to a .py file and running that file with python, doesn’t make it python code. Python understands only very specific (it’s own) syntax. What you’re trying to do could be done with python, but there’s actually no reason at all to do so (it would be orders of magnitude more complicated).

The approach you’re taking is a valid one (if a little fragile) - I would carry on as you are, but be sure you have copies of the files elsewhere incase you end up overwriting something (easily done when the files only differ by a single character potentially).

ADD REPLYlink written 9 months ago by Joe15k

Yes, this issue of mine is improving a little with the Python course I am taking. Now, in retrospect, it seems obvious that I must be writing in shell language since I did not switch to Python in Terminal. At the time, not so clear. I am keeping this operation confined to one practice folder, so I am hoping to avoid any overwriting elsewhere. Someone else in the thread recommended a way to do it with parallel, so I will attempt to chew my way around what he/she says. Some of the stuff he/she is telling me I understand, some not so much. I will try to explain my issues as I go along.

ADD REPLYlink modified 9 months ago • written 9 months ago by new15b30
1
gravatar for swbarnes2
9 months ago by
swbarnes27.0k
United States
swbarnes27.0k wrote:

What you might find easier than doing it all in the command line is to learn a language like R or perl or python, and have that software read in the file names, parse out the bits you need, and construct the command lines there, and run trimmomatic from inside the script. It might be easier to get what you are doing if you spread the work over 4 or 5 lines.

ADD COMMENTlink written 9 months ago by swbarnes27.0k

I am taking online Python courses but the speed at which I am learning is not matching the level of knowledge I need in order to deal with some of this code. Right now, I just learned what a for loop is - on extremely simple examples. I'm that new.

ADD REPLYlink written 9 months ago by new15b30

You realize most bioinformaticists are self taught? Google how to get the file names in a directory in python

ADD REPLYlink written 9 months ago by swbarnes27.0k

As you can see above, my friend has helped me answer my own question. I came here seeking help, not disparaging remarks and redundant comments. I am lucky to have a friend who knows some programming. Pity those who don't.

ADD REPLYlink written 9 months ago by new15b30
1

I’m sorry you’ve taken offence to something in this thread, but I don’t personally see any disparaging remarks anywhere? perhaps there’s some intonation lost via the medium of text...

@swbarnes2 is providing legitimate advice which is the right way to approach the task. I tell this to people in my lab all the time: it is always worth investing some time now learning how to code properly, as it will save you time in the long run. Everyone is always up against time pressure, so people fall in to the mentality of “I just need to get this thing to work now, and I’ll learn it properly later”. Later never comes - there’s always another pressure.

ADD REPLYlink written 9 months ago by Joe15k

I didn't mean to sound rude, I was just disappointed. In my original post, I stated that I am not capable of modifying existing code. My friend is at that level, which is why he was able to adjust some of the protocols we Googled and amend them to suit my needs. I was under the impression that perhaps this forum is full of people like my friend, willing teachers who would take the time to dissect some of the answers provided here and on StackOverflow and show me how I could make it work for my own data. I am learning and trying very hard to catch up, not only through online courses, but also through experiences like this, where I am taught how a specific block of code can be modified and turned into something I can actually use for an everyday task. I do, however, realize it is not your duty to educate me, nor is it your fault that I know so little. I'm a little sad, but I'll get over it. No hard feelings.

ADD REPLYlink written 9 months ago by new15b30
1

I understand your disappointment, but I do think you might have misinterpreted the forums purpose just slightly. Where possible, we try not to simply give out code, but instead lead people to the answer.

In your defence, you’ve done everything right: posed a question, shown effort, provided code examples and so on. Technically however, we could have closed this as offtopic as this is actually just a programming question and not strictly a bioinformatics one.

Personally I’m quite happy to help you out because it’s clear you’re trying - but, my advice to you is not to ‘hide’ behind “not being capable of modifying existing code”. It’s extremely rare you’ll get a potted solution that requires no modification. What happens if your filenames change for example? Everyone on this forum started from scratch once and equally you totally are capable, you just need to play with and break the code until it becomes less alien to you. This is what we all do on a daily basis.

ADD REPLYlink written 9 months ago by Joe15k

I do try to play with code, quite frequently in fact. The biggest obstacles I encounter are functional, realistic starting points that can serve as seeds for future attempts to experiment. Without these, I am stuck wallowing in my lack of understanding. I don't know how to explain it - perhaps let me try: it is as if I am looking for answers about a foreign language IN a foreign language. The answers make no sense to me without a translator. Once translated, however, I can apply these familiar blocks to decipher other problems. For example, this little adventure into Trimmomatic automation has now made it possible for me to look at code from cutadapt and modify it. I now understand it and gleaned just that much more about its general structure and syntax. In time, I will have enough experience to Google something and find what seems like workable blueprints instead of just perceived jibberish. But I do understand what you're saying. I'm asking for calculus level solutions when I don't even know basic mathematical operations. I did misinterpret the purpose of this forum and I apologize for that. I'm a bit frustrated due to the circumstances (suddenly expected to do SEQ analysis and having no background in coding whatsoever). I imagine this is the case with many students that never had the foresight or guidance to take programming courses at the undergraduate level. How I wish I had! Thank you for your kind words. I will keep on trying. That is about all one can do!

ADD REPLYlink written 9 months ago by new15b30
1

Feel free to come back with your questions - we're happy to help you and point you in the direction of a solution. I understand what you write about learning a foreign language, in a foreign language. I'm also largely a self-taught bioinformatician. But I can tell you that this feeling also doesn't really go away. You do learn, and some things will be easier in the future, but for every new analysis or tool, you'll have to figure out again how things work. You seem to have the right attitude, I'm sure you can do this.

ADD REPLYlink written 9 months ago by WouterDeCoster42k

I am just happy everytime I learn something and recognize something similar somewhere else. Like for example, my beginner Python classes are making it easier for me to look at something like Biopython. I really want the days of confusing Python with shell scripting and other forms of code to be over. I want to be able to look at a template and just know the symbology instead of being confused. lol.

ADD REPLYlink written 9 months ago by new15b30

Well you have the right attitude, so that’s half the battle.

BioPython is a huge library, I use it regularly and still only know half a dozen of the functions anything near what I’d call ‘well’. It’s a constant learning process for us all.

It’s pretty easy to know if code is python or not, simply by answering the question: Is this easy to read (for code)?

Almost all other languages are more opaque and have arcane syntax. Python has its share, sure, but you don’t normally have to concern yourself with that till later down the line.

The best way to learn is exactly what you’re doing: find a problem, and attempt to solve it. You can watch all the videos, read all the books etc. But ultimately, there’s no better education for coding that just getting stuck in.

ADD REPLYlink written 9 months ago by Joe15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1132 users visited in the last hour