Question: cutadapt summary statistics after every run
0
gravatar for bnayer26
10 months ago by
bnayer260
bnayer260 wrote:

Hi, I am new to analyzing RNA seq data and have just started running cutadapt to trim my adapter sequences from my paired end data. If I run one line of code at a time, I can see a very nice summary statistics as is given in this page of the cutadapt documentation (https://cutadapt.readthedocs.io/en/stable/guide.html#cutadapt-s-output). It looks something like this:

    This is cutadapt 2.7 with Python 3.5.2
Command line parameters: --cores=14 -q 10,10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --info-file=summary1.txt -o EBSP3_L3_1_trimmed.fq.gz -p EBSP3_L3_2_trimmed.fq.gz /home/bnay2/JeanProject/JeanRawData/EBSP3_L3_1.fq.gz /home/bnay2/JeanProject/JeanRawData/EBSP3_L3_2.fq.gz
Processing reads on 14 cores in paired-end mode ...
[    8=------] 00:34:05    31,431,443 reads  @     65.1 µs/read;   0.92 M reads/minute
Finished in 2045.05 s (65 us/read; 0.92 M reads/minute).

=== Summary ===

Total read pairs processed:         31,431,443
  Read 1 with adapter:                 866,710 (2.8%)
  Read 2 with adapter:                 839,966 (2.7%)
Pairs written (passing filters):    31,431,443 (100.0%)

Total basepairs processed: 9,429,432,900 bp
  Read 1: 4,714,716,450 bp
  Read 2: 4,714,716,450 bp
Quality-trimmed:               3,496,279 bp (0.0%)
  Read 1:     1,212,666 bp
  Read 2:     2,283,613 bp
Total written (filtered):  9,420,283,738 bp (99.9%)
  Read 1: 4,710,662,353 bp
  Read 2: 4,709,621,385 bp

=== First read: Adapter 1 ===

Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; Type: regular 3'; Length: 34; Trimmed: 866710 times.

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-34 bp: 3

Bases preceding removed adapters:
  A: 29.7%
  C: 30.1%
  G: 27.9%
  T: 12.3%
  none/other: 0.0%

Overview of removed sequences
length  count   expect  max.err error counts
3   697484  491116.3    0   697484
4   129061  122779.1    0   129061
5   33700   30694.8 0   33700
6   2812    7673.7  0   2812
7   982 1918.4  0   982
8   144 479.6   0   144

But I wanted to run the code in a loop as part of a script, and I wanted to save these summary statistics information for each trimming into a text file where I can view it later.

I see that the documentation mentions this:

Format of the info file When the --info-file command-line parameter is given, detailed information about the found adapters is written to the given file. The output is a tab-separated text file. Each line corresponds to one read of the input file (unless –times is used, see below). A row is written for all reads, even those that are discarded from the final output FASTA/FASTQ due to filtering options (such as --minimum-length).

But when I tried --info-file=summary.txt, it generated a 12 GB file that has multiple lines of the following, which is very different from the kind of output I used to get shown above:

E00552:427:H7CVVCCX2:3:1101:5071:1379 1:N:0:NAGAGAGG+NACTCCTT   -1      ATCCTGTGAGTGTGTGTATACAGATATTATAGAAATGCTTTTAGGCATCTTTGAAACCAAGCCTATGTGTGAATAGTTTGTGAAAGAGATGGCAAACTCGGATGGGGGAATACCCAAGGGTCATGGGTTTTATGTGTCGCTTTGGTGTA   AAFFJJJJJJJJFJFJJJJFJJJJJJJJJJJJJJJJJFJJJJJJFJJJJJJJJJJJJJJJJJJJJFJ<JJJJJJJ<A-<<AJJJFFAJFJJJJJFFA-A-JF7A<<J-7<AFJ7-A<F<J7AJF7AJ7---<FA-A--<)7--<<-<AA
E00552:427:H7CVVCCX2:3:1101:5294:1379 1:N:0:NAGAGAGG+NACTCCTT   -1      AAGGGGATAGGTGAGTGGCACGCCAAAGCCATGTGGGGGTCAGGGTACCAGAACTAGGCCCTGGTGCAGGAATACATCAGGCAGGAAGGGGGTGACAAAGGGGTCTGTGGGCTGCATCCATCTGGTTCTTGACTGTCGCTTATACACAT   AAFFJJFJJJJAJJJFJJJ<AJJJJJJJJJJFJFAJJ<F<JJJJJFJJJFFJJJJFAJJFJJJJFJF-AJJJJJJJJ-JF<A<AFF<FAJ<-7<-A<-F<AA-7AAJ-7FAFJJJFAJ-7-<-F---7<FFJJ<AAA)A<FFAJ<AFJF
E00552:427:H7CVVCCX2:3:1101:5538:1379 1:N:0:NAGAGAGG+NACTCCTT   -1      TCTTTCTCTAAAGAGCTGCTTCTAATCTCCTGCTGGAGGCCTCCTTGAGCTCTCAGACATGGCTTCTGCCAACCCAACCTGCTGCTGAGTAAGCCAAAGACTCTCCTTCCTGTAGAAACCAGCAAGGAGCGCCCTGCCCCTGTCCTTTT   AAAFJJFJJJJJJJJJJJJ<FJJJJJJAJJJJJJFJJ-AJJJJJJJJJJJJJJJJJFJJJJFJJJJJJJJJJJJJFA-FJ-FJAFJJF7FFFJJFFAA-7F-A7AJF7F<7A<FFFFJJFJFJ<A7)-7F<<)7-7A)-77A7AFA7<-

Can someone please help me understand how to get the summary statistics as I see them in my first output, into a summary text file so I can view it later for all my samples?

Thank you so much.

ADD COMMENTlink modified 10 months ago by ATpoint41k • written 10 months ago by bnayer260
1
gravatar for ATpoint
10 months ago by
ATpoint41k
Germany
ATpoint41k wrote:

This is printed to stderr so you can capture it basically like:

for i in files*
  do
  cutadapt (...) 2> summary_for_${i}
  done

The 2> does the trick. Once you have this you could summarize all reports into a nice html report using multiqc.

ADD COMMENTlink modified 10 months ago • written 10 months ago by ATpoint41k

Hi thanks for your answer! I will give it a try...so you mean I should add this 2> at the very end of my cutadapt code right? And do I need to give any file extension? assuming my summary file name is summary1. So I just write 2> summary1 is that it? or should I give some extension to the filename?

Thanks again!

ADD REPLYlink written 10 months ago by bnayer260

Yes, add it as you say. File names are optional, it will simply be a plain text document, I typically use txt but be aware that the name is unique in every iteration otherwise it will be overwritten. To append to the same file use 2>>.

ADD REPLYlink modified 10 months ago • written 10 months ago by ATpoint41k

Okay thank you so much I will give it a try!

ADD REPLYlink written 10 months ago by bnayer260

I have tried this method and it writes a blank file. -Found answer. I had specified the -o argument, so the summary was not going to standard error, it was going to standard output. Using 1> filename instead of 2>filename therefore worked.

ADD REPLYlink modified 9 months ago • written 9 months ago by samhairle0

Please show code, anecdotal descriptions are difficult to debug.

ADD REPLYlink written 9 months ago by ATpoint41k

Hi I'm having more or less the same issue. I'm trying to get all the details from the summary once cutadapt had removed all the pair-end primers. My script looks like this:

cutadapt<-"/share/apps/anaconda/envs/R360/bin/cutadapt"

system2(cutadapt, args = "--version")

path.cut.Man <- file.path(pathMan, "cutadapt") if(!dir.exists(path.cut.Man)) dir.create(path.cut.Man) fnFs.cut.Man <- file.path(path.cut.Man, basename(fnFsman)) fnRs.cut.Man <- file.path(path.cut.Man, basename(fnRsman))

FWD.RC <- dada2:::rc(FWD1) REV.RC <- dada2:::rc(REV)

Trim FWD and the reverse-complement of REV off of R1 (forward reads)

R1.flags <- paste("-g", FWD1, "-a", REV.RC)

Trim REV and the reverse-complement of FWD off of R2 (reverse reads)

R2.flags <- paste("-G", REV, "-A", FWD.RC)

Run Cutadapt

for(i in seq_along(fnFsman)) { system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2,
"-o 1> report.txt", fnFs.cut.Man[i], "-p 1> report.txt", fnRs.cut.Man[i],
fnFs.filtN.Man[i], fnRs.filtN.Man[i])) }

[ 8<------] 00:00:13 174,110 reads @ 75.6 µs/read; 0.79 M reads/minute [ 8=---------] 00:00:04 63,102 reads @ 76.2 µs/read; 0.79 M reads/minute [ 8<-----] 00:00:15 205,940 reads @ 75.8 µs/read; 0.79 M reads/minute [ 8<--------] 00:00:06 85,910 reads @ 74.6 µs/read; 0.80 M reads/minute [ 8=---------] 00:00:05 66,637 reads @ 76.4 µs/read; 0.79 M reads/minute [ 8<--------] 00:00:06 83,522 reads @ 76.0 µs/read; 0.79 M reads/minute [ 8=--------] 00:00:07 100,576 reads @ 73.1 µs/read; 0.82 M reads/minute [ 8<--------] 00:00:06 84,447 reads @ 73.5 µs/read; 0.82 M reads/minute [ 8<--------] 00:00:06 81,623 reads @ 73.8 µs/read; 0.81 M reads/minute [ 8<---------] 00:00:04 58,366 reads @ 77.6 µs/read; 0.77 M reads/minute [ 8=---------] 00:00:05 79,614 reads @ 73.5 µs/read; 0.82 M reads/minute [ 8<-------] 00:00:10 139,749 reads @ 76.5 µs/read; 0.78 M reads/minute [8=----------] 00:00:01 25,220 reads @ 77.9 µs/read; 0.77 M reads/minute [ 8<--------] 00:00:06 87,436 reads @ 73.7 µs/read; 0.81 M reads/minute [ 8=---------] 00:00:05 78,237 reads @ 74.7 µs/read; 0.80 M reads/minute

It works but the arguments for the txt output files or "report.txt" only take in to account the last summary data ( the one for :[ 8=---------] 00:00:05 78,237 reads @ 74.7 µs/read; 0.80 M reads/minute) and I would like to have them all.

I would be very grateful if someone could make some observations/ corrections for this issue.

ADD REPLYlink written 6 months ago by brenda.cla0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1257 users visited in the last hour