Question: Cutadapt for loop issue
0
gravatar for geneart$$
7 months ago by
geneart$$40
United States
geneart$$40 wrote:

Hi all,I wrote this script to loop over two sets of fastq files I have (sample1_R1 and sample1_R2 sample2_R1 sample2_R2) I want cutadapt to do its thing on sample1 as paired end and sample2 as paired end.

Originally I had issues with this loop iterating over the trimmed output but I changed "${name}_trimmed.R2.fastq.gz" for -p, and -o, to "${name}_trimmed.fastq.gz" and this took care of that issue. My for loop can be found in the code below under COMMAND EXECUTED.

Now, the problem is this:

  1. This change I made to the output file names, writes the correct names for the output files in my folder. so I have sample1_R1_trimmed.fastq.gz and the same sample1_R2_trimmed.fastq.gz. Good so far, although there is something I noticed in the output summary file( when the script is executed I have job_ID.o and job_ID.e files, the summary report I am referring to is the Job_ID.o file ) , which if not redirected to a text file would spit out as "JobID.o" file.

Below is the first part of my Job_ID.o report:

Notice the command line parameters. It is taking in all R1 at a time and similarly taking all R2s at a time( not posted the R2 part). So basically the script is scmehowtelling Cutadapt to treat the files as single end, even though I have the -o and -p option which is issue #1.

Second issue is, despite this, the output files (sample1_R1_trimmed.fastq.gz and sample1_R2_trimmed.fastq.gz) that are written in my folder have the correct R1 and R2 names! Although I don't believe the the read trimmed is complteley good.

Why is my script treating this as SE? It could be I gave ${name}.fastq.gz twice and hence treating it as SE? I need to give it two files and I can't take delete one of these. So is there a way to indicate to use R1 and R2 at this point in code? I tried putting ${name}_R1.fastq.gz and that did not process any files at all! Am I losing my mind here? or not getting this Cutadapt thing right? After all this might be a Cutadapt thing and not a scripting thing?

So can I do to see if it is indeed Cutadapt problem or my code! you guessed it right! It is my code :) If you scroll down below my code here, you can see I ran Cutadapt directly on command line without a for loop and compared the summary of trimming to the output from the cold below on both sample 1 and sample2.

 ---- COMMAND EXECUTED: ---------------------------------------------------------
    (   for i in *.fastq.gz;  do name=(`basename "$i" .fastq.gz`); cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o ${name}_trimmed.fastq.gz -p ${name}_trimmed.fastq.gz ${name}.fastq.gz ${name}.fastq.gz; done )
    --------------------------------------------------------------------------------
    This is cutadapt 2.9 with Python 3.7.7
    Command line parameters: -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o sample1_R1_trimmed.fastq.gz -p sample1_R1_trimmed.fastq.gz sample1_R1.fastq.gz sample1_R1.fastq.gz
    Processing reads on 1 core in paired-end mode ...
    Finished in 82.19 s (115 us/read; 0.52 M reads/minute).

Here is the Cutadapt directly on commadline for both sample1 and sample2 as PE, to see if the numbers made sense .

for sample1

total read pairs processed:            716,612
  Read 1 with adapter:                  **33,407** (4.7%)
  Read 2 with adapter:                  **33,450** (4.7%)

for sample 1: Below is summary for Cutadapt launched through my script ( command in the script given above).

Total read pairs processed:            716,612
  Read 1 with adapter:                  **33,407** (4.7%)
  Read 2 with adapter:                  29,236 (4.1%)

for sample 1: For R2:

Total read pairs processed:            716,612
  Read 1 with adapter:                  29,435 (4.1%)
  Read 2 with adapter:                  **33,450** (4.7%)

Similar pattern for sample2. (not attached here) See? Too many funky things going on! I am not writing this script correctly and can't figure out what the problem is? I tried many modification to this code later to see if I can fix something. But honestly it is all a blind hunt! I want to understand my problem in my code here. Any help from anyone to correct this is greatly appreciated!

ADD REPLY • link • edit • moderate written 1 hour ago by geneart$$ • 10

scripting forloop unix cutadapt • 373 views
ADD COMMENTlink modified 7 months ago by geek_y11k • written 7 months ago by geneart$$40
1

The loop processes one sample_RX.fastq.gz a time, so it is obvious that it will process R1 first, then R2

Why don't you just loop over the sample names, like

for sample in `cat samples.list`; do
  cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o ${sample}_R1_trimmed.fastq.gz -p ${sample}_R2_trimmed.fastq.gz ${sample}_R1.fastq.gz ${sample}_R2.fastq.gz
done
ADD REPLYlink modified 7 months ago • written 7 months ago by yztxwd380
2

before doing cat sample.list, one should specify IFS=$'\n'

ADD REPLYlink written 7 months ago by geek_y11k
1

Yes! I forgot this, it's better to store old IFS, then get it back after things done

old_IFS=$IFS
IFS=$'\n'
for sample in `cat samples.list`; do
  cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o ${sample}_R1_trimmed.fastq.gz -p ${sample}_R2_trimmed.fastq.gz ${sample}_R1.fastq.gz ${sample}_R2.fastq.gz
done
IFS=$old_IFS
ADD REPLYlink written 7 months ago by yztxwd380

Thanks yztxwd! Will try and update.

ADD REPLYlink written 7 months ago by geneart$$40
1
gravatar for geek_y
7 months ago by
geek_y11k
Barcelona
geek_y11k wrote:

Tooo long and confusing post.

In summary what I understood is you want to loop over fastq and run cut adapt.

for sample in *_R1_fastq.gz
do
name=`basename ${sample} "_R1_fastq.gz"`
echo "somecommand ${name}_R1_fastq.gz ${name}_R2_fastq.gz"
done

you dont need to loop over *.fastq.gz instead only loop over R1 or R2.

Other alternatives you could try GNU parallel:

ls *_R1_fastq.gz | sed 's/_R1_fastq.gz//' | parallel --jobs 2 "echo somecommand {}_R1_fastq.gz {}_R2_fastq.gz"

or

parallel "echo somecommand {}_R1_fastq.gz {}_R2_fastq.gz" ::: `ls *_R1_fastq.gz | sed 's/_R1_fastq.gz//'`
ADD COMMENTlink modified 7 months ago • written 7 months ago by geek_y11k
2

Or, simply, somecommand $sample ${sample/_R1_/_R2_} to skip the basename detour assuming _R1_ is found just once in the file name.

ADD REPLYlink written 7 months ago by _r_am30k

Thanks RamRS, will try and see if this works.

ADD REPLYlink written 7 months ago by geneart$$40

I implemented the suggested echo $sample ${sample/R1.fastq.gz/R2.fastq.gz} in my code and that did the trick of taking in R1 and R2 files properly. Thankyou so much for that tip!

One thing is my output files -o and -p are being named respectively sample1_R1.fastq.gz_trimmed.fastq.gz and -p is giving sample1_R2.fastq.gz_trimmed.fastq.gz.

I can always rename my files but not an elegant way of doing things and not practical as I have huge number of fastqs. how can I trim off the first _R*.fastq.gz part in both my output files? I tried sed sed 's/_R1\.fastq\.gz//' but it did not work.

I would have to introduce this sed in this part -p ${sample/R1.fastq.gz/R2.fastq.gz} for both -o an d-p but not working!

ADD REPLYlink modified 7 months ago • written 7 months ago by geneart$$40
1

You should look at this site: http://wiki-dev.bash-hackers.org/syntax/pe

You can replace R1.fastq with trimmed.R1.fastq or something along those lines. These are file naming problems and working on them on your own will increase your shell aptitude. Good luck!

ADD REPLYlink written 7 months ago by _r_am30k

Will do! Thankyou for the link :)

ADD REPLYlink written 7 months ago by geneart$$40
1

RamRS, again thanks for the link! I learnt a lot ! I did change the ${sample} all across my code to ${f} just to keep it simple. Just FYI for anyone else dealing with the same problem as I did! below is what did the trick and named my output files correctly.

-o ${f/R1.fastq.gz/R1}_trimmed.fastq.gz -p ${f/R1.fastq.gz/R2}_trimmed.fastq.gz

or

-o ${f%%.*}_trimmed.fastq.gz -p ${f/R1.fastq.gz/R2}_trimmed.fastq.gz
ADD REPLYlink modified 7 months ago • written 7 months ago by geneart$$40

Thanks geek_y, Sorry for the long post but wanted to give as much detail as possible! You know your comment made me realize why my code is not taking R1 and R2 to feed into Cutadapt. It is looping over each individual fastqs at a time and the Cutadapt command within the for loop is taking just that one fastq at time, for example R1 only, at one time and processing ( which seemed like a SE processing) instead of taking the R1 and R2 of the same sample to feed into Cutadapt to process.

Essentially I need to somehow let the for loop, take R1 and R2 of the same sample and feed it into Cutadapt command as input. Infact what I wrote should be good to process single end reads instead of paired end! Sorry for not being clear. I will try all the suggestions and update.

ADD REPLYlink modified 7 months ago • written 7 months ago by geneart$$40
1

yes.

ADD REPLYlink written 7 months ago by geek_y11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1554 users visited in the last hour