Automating Trimmomatic for total Noobs
3
1
Entering edit mode
2.9 years ago
new15b ▴ 30

Hi! I have tried searching for answers to this question, but all of the answers I find are written in a fashion that is out of my league. Most of the people asking the questions, in fact, know more than I do. When it comes to programming, I am a complete and utter noob.

I am trying to automate Trimmomatic to cut the adapters and improve quality on more than just one set of reads. I have quite a large set of forward and reverse RNA-seq fq.gz files and I need to process them. Currently, I have this code:

java -jar /home/Trimming/trimmomatic-0.38.jar PE -threads 24 -phred33 XX_2_1_1.fq.gz XX_2_1_2.fq.gz F2_1_paired.fq.gz F2_1_unpaired.fq.gz R2_1_paired.fq.gz R2_1_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5 &> trim.log &


I understand what this code means in terms of context and am fine running it for a single set of forward and reverse reads (here they are named XX_2_1_1.fq.gz and XX_2_1_2.fq.gz respectively.)

For any other total noobs like me reading this, my output files are F2_1_paired.fq.gz F2_1_unpaired.fq.gz for the forward reads and R2_1_paired.fq.gz R2_1_unpaired.fq.gz for the reverse reads. Problem is, those are not my only read sets. To avoid having to run this process over and over again, I would like to find a way to allow it to find all my reads that need trimming and then execute it while I do other things.

Unfortunately, as a non-programmer, I can't just write or modify code at will. I know how to automate, for example, FastQC by using parallel in this manner, where I find any files with a certain word or fragment in them and then bulk FastQC them:

find *paired*.gz | parallel 'fastqc {}' &


This is not directly transferable though. I can't just wildcard XX and then parallel everything to Trimmomatic, because unlike FastQC, Trimmomatic requires me to specify names for output files. Can someone please help me come up with code that will enable me to do this? I would like to learn it as well - understand what it means and the logical thought process that went into the design of it.

Thank you very much in advance!

RNA-Seq • 3.9k views
6
Entering edit mode
2.9 years ago

read up on how GNU parallel works and you will see that it can be made to generate inputs by groups for example

seq 1 10 | parallel -n 2 echo {1} {2}


will produce

1 2
3 4
5 6
7 8
9 10


That being said don't run bioinformatics tools on file "patterns" like ls *.fq or find *.fq The order of the files may not be what you expect and will eventually cause major headaches.

As a rule try to compile lists of "root" names or ids that fully identify the files that you want to process. For example, say the root is SRR12234, then build your command around these known entities

cat ids.txt | parallel "trimmomatic {}_1.fq {}_2.fq"
cat ids.txt | parallel "bwa mem hg38 {}_1.fq {}_2.fq > {}.sam"


and so on. Basically, it is you who should explicitly create the file names from known information rather than crawling the directory for file names. Sticking to this simple tip can save you a lot of trouble.

See more examples here Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

0
Entering edit mode

Hi! Thank you for this suggestion. So if I understand correctly, I will use the following code?

cat ids.txt | parallel "java -jar /*filepathTrimmomatic/trimmomatic-0.38.jar PE -phred33 ${f}_R1.fq.gz${f}_R2.fq.gz ${f}_F_paired.fq.gz${f}_F_unpaired.fq.gz ${f}_R_paired.fq.gz${f}_R_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5"


Am I correct? I was told the Trimmomatic program requires java -jar plus the path name of its location as a prefix.

0
Entering edit mode

not sure where you get the ${f} from, only the pattern {} is needed. install trimmomatic with conda, now you don't need to worry about jars etc. in addition set the --baseout option, now you don't need to name the files anymore, they will automatically gain the desired filename prefix. As for the question, the way to troubleshoot GNU Parallel commands is to put an echo before the command that you are trying to execute. That way you can see what the command looks like when it would be submitted to the computer. Here I put an echo in front of the command java foo that say I'd want to troubleshoot: seq 1 10 | parallel 'echo java foo {}_1.fq {}_2.bar'  when the above runs it will print: java foo 1_1.fq 1_2.bar java foo 2_1.fq 2_2.bar java foo 3_1.fq 3_2.bar java foo 4_1.fq 4_2.bar java foo 5_1.fq 5_2.bar java foo 6_1.fq 6_2.bar java foo 7_1.fq 7_2.bar java foo 8_1.fq 8_2.bar java foo 9_1.fq 9_2.bar java foo 10_1.fq 10_2.bar  The lines above are what parallel would execute if there was no echo command in front. If the commands above are correct remove the echo. ADD REPLY 1 Entering edit mode The same result as echo is obtained using parallel --dryrun. ADD REPLY 0 Entering edit mode I can get rid of the$ symbols, no problem.

Conda.... is it like pip for Python? It seems to be a package installing tool and that won't be of much use to me, since I don't have permissions to install any packages on the server. For example, my attempts to install cutadapt resulted in utter failure since pip is not available to me as a student. I think conda is something similar? I am not sure, this is my first experience encountering either. As long as the files are .dmg, I can transfer them to folders and run them, but packages that aren't already on the server are out of reach for me. Trimmomatic is a dmg file, so I could install it locally and then transfer it to the server. I think. Filezilla does allow transfers. I can't do any of this locally, since the files are too big, so I am constrained to the school's server space and all the limitations that come with it. Can I still use jar? And the baseout option - where would I put it so that it would work? And do I just write --baseout? This is another issue I have - often the guides for a package or program will list options containing symbols other than those I actually need to write, for illustration purposes. I have no way of discerning proper code from improper code, I will often copy as is and run into trouble. Example: --someoption <filename>

I would instinctively keep the brackets around the filename, but once, when I did that, it was utterly wrong. Similarly, I have found other symbols that commonly mean something in code, used for other purposes (mainly in guides), like the | (pipe) symbol, even though piping was not called for. I have no way of identifying it as non-code and take it literally. Does that make sense?

0
Entering edit mode

It might be useful for you to join our biostars slack group biostar.slack.com: Chat for the biostars community

1
Entering edit mode

Sure, why not? Anything that helps.

0
Entering edit mode

conda is like pip, yes, only much, much easier to use (especially if you're a noob).

It's also useful for precisely the reason that you don't have permission to install things, since it does everything inside your user directory. No special permissions needed. In fact, it may be one of the only ways you can install things as a non-admin user, so I do strongly urge you to try using it. The setup is very, very simple, I promise.

You just have to learn the conventions. The real problem is that they aren't universal. It's common, but not, always the case, to denote variables/arguments <like so> implying that what's between the brackets is just a placeholder (inclusive of the brackets).

Istvan's suggestion of parallel is well worth learning, but it is a slightly more advanced technique, so there is perhaps an element of trying to run before walking here.

0
Entering edit mode

Okay stupid noob question: Will Conda download packages remotely to the school server even though it is located on my computer? Or do I have to move it to my school user directory?

0
Entering edit mode

Are you trying to run these analyses on your laptop or on the server (the short answer is you could just do both).

0
Entering edit mode

Definitely not on my laptop - the files are too enormous and the computation too extreme for my poor computer. I only run things on the server. But I downloaded Anaconda2 to my laptop.

0
Entering edit mode

You need to download conda and install into your account on server at school. Then things you install via conda will then live in your personal space on the server.

0
Entering edit mode

I don't have permissions to install anything to my server. All of those sudo commands online that I find do not work for me because they require a password. Leaving out sudo just gets it to ask me if I am root. I think I was utterly confused at how I got Trimmomatic to work as well. From what I can see now, it is a jar file, not a dmg.

0
Entering edit mode

You shouldn't need sudo to install conda, and afterwards you can install everything. But we don't know your particular server of course. I've doing fine without sudo permission for years :)

0
Entering edit mode

Advantage of conda is that it does not need root privileges. You can install things in your own home directory. See this guide: Creating workflows with snakemake and conda

No sudo should be needed.

1
Entering edit mode

I downloaded conda to my directory and tried to install Trimmomatic. But I got this message:

Solving environment: failed

I was so happy that I got conda to install... and then this.

EDIT: I got it to work with this:

conda install -c bioconda trimmomatic

0
Entering edit mode

I don't think this is thread is the right place to learn about conda, troubleshoot your system, permissions etc.

in general, this site is not a forum for free-form back-and-forth chat. The threads should be focused on answering a specific question, and while some deviation is fine, this is now going towards the excessive.

if you have a new, specific question, it should be asked separately. This helps in keeping each page focused on a single topic.

0
Entering edit mode

This business with Conda is in response to your suggestion, however. Without conda, I will not be able to download Trimmomatic as a python package and can't try any of your code. While this thread may look chaotic, it has a flow. You suggested I run Trimmomatic without jar, someone else hinted at how I could do it and I am trying to get there. The end result will hopefully be me running parallel as a solution to Trimmomatic automation.

0
Entering edit mode

I now have conda and Trimmomatic installed as a package - I tried to run the code:

cat ids.txt | parallel 'trimmomatic PE {}R1*.f*q.gz {}R2*.f*q.gz {}pair_R1.fq.gz {}unpair_R1.fq.gz {}pair_R2.fq.gz {}unpair_R2.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5'


But it told me parallel: Warning: Input is read from the terminal. Only experts do this on purpose. Press CTRL-D to exit.

I kind of figured I would have a problem, even if I don't understand what kind, because I do not have an ids.txt file, nor do I know what one would put in an ids.text file. Googling showed me nothing but highly confusing things that make no sense to me at this stage. I am also not sure how baseout works - in the manual it says to provide a base file name for the outputs and it will then just automatically name them basename + 1P 1UP 2P 2UP for the four output files. Nice, but I have more than one set of reads and they don't all have the same base name. Do I just list the wildcard names for them?

EDIT: Is ids.txt just a list of the files I have? If so, do I just copy paste them into a textfile, regardless of formatting? I just made a file like this and saved it to the folder with the actual fq files (where I am doing my trimming). This time, the Trimmomatic process starts, but it tells me it cannot find any of my files and stops. The files are still there though.

UPDATE: Changed my ids.txt file around to contain only the unique prefixes, no file endings, listed on a separate line each, like this:

## X2_C3

Then, I ran this code:

cat ids.txt | parallel --jobs 8 'trimmomatic PE {}_R1.fq.gz {}_R2.fq.gz {}pair_R1.fq.gz {}unpair_R1.fq.gz {}pair_R2.fq.gz {}unpair_R2.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5'


And it seems to be working. Now all I need to do is figure out the baseout thing. If anyone has any tips, I'd love that.

1
Entering edit mode
2.9 years ago
GenoMax 111k

See these past threads for assistance with this:
Linux for loop for 2 inputs and 4 outputs for Trimmomatic fastq quality trimming
Trimmomatic job script to run on multiple pair end read file

Hint: Put the word echo before trimmomatic command in the loops to print out the actual command lines. Verify they look good. If they do, then remove the word echo to actually run them.

0
Entering edit mode

And with parallel you can use --dryrun

0
Entering edit mode

I'd tried to make sense of those and couldn't. I cannot, for the life of me, apply this to my data and get it to work. I even changed the names of my files to X2_C1_R1 etc. to make it easier. A friend of mine tried to play around with it and came up with the following:

for f in $(ls *.fq.gz | sed -e 's/_1.fq.gz//' -e 's/_2.fq.gz' | sort -u) do java -jar /usr/local/bin/Trimmed/trimmomatic-0.38.jar PE -phred33${f}_1.fq.gz ${f}_2.fq.gz${f}_F_paired.fq.gz ${f}_F_unpaired.fq.gz${f}_R_paired.fq.gz ${f}_R_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5 done  I tried putting that into a py file, but the thing won't run. Although he knows vastly more than I, neither one of us has the technical knowledge to straighten this mess out. Looking at forums where other people solved their own specific case scenarios isn't helping us, since we don't know how to modify their success into one for ourselves. I am very new to this. ADD REPLY 0 Entering edit mode Alright, we got it to run for these names (well it has processed C1 and C2 so far): X2_C1_R1 X2_C1_R2 X2_C2_R1 X2_C2_R2 X2_C3_R1 X2_C3_R2 The code: for f in$(ls *.fq.gz | sed -e 's/_R1.fq.gz//' -e 's/_R2.fq.gz//' | sort -u)
do
java -jar /*filepathTrimmomatic/trimmomatic-0.38.jar PE -phred33 ${f}_R1.fq.gz${f}_R2.fq.gz ${f}_F_paired.fq.gz${f}_F_unpaired.fq.gz ${f}_R_paired.fq.gz${f}_R_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:5
done


I copy pasted this into Text Wrangler, which seems to understand Python spacing requirements. For all you noobs out there, the first line of code translates into the following: We are defining "f" as elements from a list of everything in the folder that has an ending .fq.gz. Then we are taking these strings and sending them to sed, which now extracts everything in their names aside from R1.fq.gz and R2.fq.gz. Then we send those strings off to sort, where only unique names are selected - this means we are getting rid of dupes. In our case this left us with three unique strings: X2_C1, X2_C2 and X2_C3. This is then all sent to the Trimmomatic command where every "f" is now sequentially replaced by one of the strings we defined in the first process through sed and sort. During the first loop, everything receives an X2_C1, during the second loop it is an X2_C2 and finally an X2_C3. What happens is that when X2_C1 is printed to the input files {f}_1.fq.gz and {f}_2.fq.gz in place of "f", these files are used to generate the output files, by replacing "f" with X2_C1 as well.

1
Entering edit mode

You aren’t using python, so ignore it completely for the time being.

What you’re doing/attempting is to call programs (namely a Java program) via the commandline which understands the bash language. There is precisely 0 python at work here at the moment.

Putting code in to a .py file and running that file with python, doesn’t make it python code. Python understands only very specific (it’s own) syntax. What you’re trying to do could be done with python, but there’s actually no reason at all to do so (it would be orders of magnitude more complicated).

The approach you’re taking is a valid one (if a little fragile) - I would carry on as you are, but be sure you have copies of the files elsewhere incase you end up overwriting something (easily done when the files only differ by a single character potentially).

0
Entering edit mode

Yes, this issue of mine is improving a little with the Python course I am taking. Now, in retrospect, it seems obvious that I must be writing in shell language since I did not switch to Python in Terminal. At the time, not so clear. I am keeping this operation confined to one practice folder, so I am hoping to avoid any overwriting elsewhere. Someone else in the thread recommended a way to do it with parallel, so I will attempt to chew my way around what he/she says. Some of the stuff he/she is telling me I understand, some not so much. I will try to explain my issues as I go along.

1
Entering edit mode
2.9 years ago

What you might find easier than doing it all in the command line is to learn a language like R or perl or python, and have that software read in the file names, parse out the bits you need, and construct the command lines there, and run trimmomatic from inside the script. It might be easier to get what you are doing if you spread the work over 4 or 5 lines.

0
Entering edit mode

I am taking online Python courses but the speed at which I am learning is not matching the level of knowledge I need in order to deal with some of this code. Right now, I just learned what a for loop is - on extremely simple examples. I'm that new.

0
Entering edit mode

You realize most bioinformaticists are self taught? Google how to get the file names in a directory in python

0
Entering edit mode

As you can see above, my friend has helped me answer my own question. I came here seeking help, not disparaging remarks and redundant comments. I am lucky to have a friend who knows some programming. Pity those who don't.

1
Entering edit mode

I’m sorry you’ve taken offence to something in this thread, but I don’t personally see any disparaging remarks anywhere? perhaps there’s some intonation lost via the medium of text...

@swbarnes2 is providing legitimate advice which is the right way to approach the task. I tell this to people in my lab all the time: it is always worth investing some time now learning how to code properly, as it will save you time in the long run. Everyone is always up against time pressure, so people fall in to the mentality of “I just need to get this thing to work now, and I’ll learn it properly later”. Later never comes - there’s always another pressure.

0
Entering edit mode

I didn't mean to sound rude, I was just disappointed. In my original post, I stated that I am not capable of modifying existing code. My friend is at that level, which is why he was able to adjust some of the protocols we Googled and amend them to suit my needs. I was under the impression that perhaps this forum is full of people like my friend, willing teachers who would take the time to dissect some of the answers provided here and on StackOverflow and show me how I could make it work for my own data. I am learning and trying very hard to catch up, not only through online courses, but also through experiences like this, where I am taught how a specific block of code can be modified and turned into something I can actually use for an everyday task. I do, however, realize it is not your duty to educate me, nor is it your fault that I know so little. I'm a little sad, but I'll get over it. No hard feelings.

1
Entering edit mode

I understand your disappointment, but I do think you might have misinterpreted the forums purpose just slightly. Where possible, we try not to simply give out code, but instead lead people to the answer.

In your defence, you’ve done everything right: posed a question, shown effort, provided code examples and so on. Technically however, we could have closed this as offtopic as this is actually just a programming question and not strictly a bioinformatics one.

Personally I’m quite happy to help you out because it’s clear you’re trying - but, my advice to you is not to ‘hide’ behind “not being capable of modifying existing code”. It’s extremely rare you’ll get a potted solution that requires no modification. What happens if your filenames change for example? Everyone on this forum started from scratch once and equally you totally are capable, you just need to play with and break the code until it becomes less alien to you. This is what we all do on a daily basis.

0
Entering edit mode

1
Entering edit mode

Feel free to come back with your questions - we're happy to help you and point you in the direction of a solution. I understand what you write about learning a foreign language, in a foreign language. I'm also largely a self-taught bioinformatician. But I can tell you that this feeling also doesn't really go away. You do learn, and some things will be easier in the future, but for every new analysis or tool, you'll have to figure out again how things work. You seem to have the right attitude, I'm sure you can do this.

0
Entering edit mode

I am just happy everytime I learn something and recognize something similar somewhere else. Like for example, my beginner Python classes are making it easier for me to look at something like Biopython. I really want the days of confusing Python with shell scripting and other forms of code to be over. I want to be able to look at a template and just know the symbology instead of being confused. lol.

0
Entering edit mode

Well you have the right attitude, so that’s half the battle.

BioPython is a huge library, I use it regularly and still only know half a dozen of the functions anything near what I’d call ‘well’. It’s a constant learning process for us all.

It’s pretty easy to know if code is python or not, simply by answering the question: Is this easy to read (for code)?

Almost all other languages are more opaque and have arcane syntax. Python has its share, sure, but you don’t normally have to concern yourself with that till later down the line.

The best way to learn is exactly what you’re doing: find a problem, and attempt to solve it. You can watch all the videos, read all the books etc. But ultimately, there’s no better education for coding that just getting stuck in.