cutadapt summary statistics after every run
2.7 years ago
bnayer26 • 0

Hi, I am new to analyzing RNA seq data and have just started running cutadapt to trim my adapter sequences from my paired end data. If I run one line of code at a time, I can see a very nice summary statistics as is given in this page of the cutadapt documentation (https://cutadapt.readthedocs.io/en/stable/guide.html#cutadapt-s-output). It looks something like this:

    This is cutadapt 2.7 with Python 3.5.2
Command line parameters: --cores=14 -q 10,10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --info-file=summary1.txt -o EBSP3_L3_1_trimmed.fq.gz -p EBSP3_L3_2_trimmed.fq.gz /home/bnay2/JeanProject/JeanRawData/EBSP3_L3_1.fq.gz /home/bnay2/JeanProject/JeanRawData/EBSP3_L3_2.fq.gz
Processing reads on 14 cores in paired-end mode ...

=== Summary ===

Pairs written (passing filters):    31,431,443 (100.0%)

Total basepairs processed: 9,429,432,900 bp
Quality-trimmed:               3,496,279 bp (0.0%)
Total written (filtered):  9,420,283,738 bp (99.9%)

Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; Type: regular 3'; Length: 34; Trimmed: 866710 times.

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-34 bp: 3

A: 29.7%
C: 30.1%
G: 27.9%
T: 12.3%
none/other: 0.0%

Overview of removed sequences
length  count   expect  max.err error counts
3   697484  491116.3    0   697484
4   129061  122779.1    0   129061
5   33700   30694.8 0   33700
6   2812    7673.7  0   2812
7   982 1918.4  0   982
8   144 479.6   0   144


But I wanted to run the code in a loop as part of a script, and I wanted to save these summary statistics information for each trimming into a text file where I can view it later.

I see that the documentation mentions this:

Format of the info file When the --info-file command-line parameter is given, detailed information about the found adapters is written to the given file. The output is a tab-separated text file. Each line corresponds to one read of the input file (unless –times is used, see below). A row is written for all reads, even those that are discarded from the final output FASTA/FASTQ due to filtering options (such as --minimum-length).

But when I tried --info-file=summary.txt, it generated a 12 GB file that has multiple lines of the following, which is very different from the kind of output I used to get shown above:

E00552:427:H7CVVCCX2:3:1101:5071:1379 1:N:0:NAGAGAGG+NACTCCTT   -1      ATCCTGTGAGTGTGTGTATACAGATATTATAGAAATGCTTTTAGGCATCTTTGAAACCAAGCCTATGTGTGAATAGTTTGTGAAAGAGATGGCAAACTCGGATGGGGGAATACCCAAGGGTCATGGGTTTTATGTGTCGCTTTGGTGTA   AAFFJJJJJJJJFJFJJJJFJJJJJJJJJJJJJJJJJFJJJJJJFJJJJJJJJJJJJJJJJJJJJFJ<JJJJJJJ<A-<<AJJJFFAJFJJJJJFFA-A-JF7A<<J-7<AFJ7-A<F<J7AJF7AJ7---<FA-A--<)7--<<-<AA
E00552:427:H7CVVCCX2:3:1101:5294:1379 1:N:0:NAGAGAGG+NACTCCTT   -1      AAGGGGATAGGTGAGTGGCACGCCAAAGCCATGTGGGGGTCAGGGTACCAGAACTAGGCCCTGGTGCAGGAATACATCAGGCAGGAAGGGGGTGACAAAGGGGTCTGTGGGCTGCATCCATCTGGTTCTTGACTGTCGCTTATACACAT   AAFFJJFJJJJAJJJFJJJ<AJJJJJJJJJJFJFAJJ<F<JJJJJFJJJFFJJJJFAJJFJJJJFJF-AJJJJJJJJ-JF<A<AFF<FAJ<-7<-A<-F<AA-7AAJ-7FAFJJJFAJ-7-<-F---7<FFJJ<AAA)A<FFAJ<AFJF
E00552:427:H7CVVCCX2:3:1101:5538:1379 1:N:0:NAGAGAGG+NACTCCTT   -1      TCTTTCTCTAAAGAGCTGCTTCTAATCTCCTGCTGGAGGCCTCCTTGAGCTCTCAGACATGGCTTCTGCCAACCCAACCTGCTGCTGAGTAAGCCAAAGACTCTCCTTCCTGTAGAAACCAGCAAGGAGCGCCCTGCCCCTGTCCTTTT   AAAFJJFJJJJJJJJJJJJ<FJJJJJJAJJJJJJFJJ-AJJJJJJJJJJJJJJJJJFJJJJFJJJJJJJJJJJJJFA-FJ-FJAFJJF7FFFJJFFAA-7F-A7AJF7F<7A<FFFFJJFJFJ<A7)-7F<<)7-7A)-77A7AFA7<-


Can someone please help me understand how to get the summary statistics as I see them in my first output, into a summary text file so I can view it later for all my samples?

Thank you so much.

2.7 years ago
ATpoint 65k

This is printed to stderr so you can capture it basically like:

for i in files*
do
done


The 2> does the trick. Once you have this you could summarize all reports into a nice html report using multiqc.

Hi thanks for your answer! I will give it a try...so you mean I should add this 2> at the very end of my cutadapt code right? And do I need to give any file extension? assuming my summary file name is summary1. So I just write 2> summary1 is that it? or should I give some extension to the filename?

Thanks again!

Yes, add it as you say. File names are optional, it will simply be a plain text document, I typically use txt but be aware that the name is unique in every iteration otherwise it will be overwritten. To append to the same file use 2>>.

I have tried this method and it writes a blank file. -Found answer. I had specified the -o argument, so the summary was not going to standard error, it was going to standard output. Using 1> filename instead of 2>filename therefore worked.

Please show code, anecdotal descriptions are difficult to debug.

Okay thank you so much I will give it a try!

Hi I'm having more or less the same issue. I'm trying to get all the details from the summary once cutadapt had removed all the pair-end primers. My script looks like this:

path.cut.Man <- file.path(pathMan, "cutadapt") if(!dir.exists(path.cut.Man)) dir.create(path.cut.Man) fnFs.cut.Man <- file.path(path.cut.Man, basename(fnFsman)) fnRs.cut.Man <- file.path(path.cut.Man, basename(fnRsman))

# Trim FWD and the reverse-complement of REV off of R1 (forward reads)

R1.flags <- paste("-g", FWD1, "-a", REV.RC)

# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)

R2.flags <- paste("-G", REV, "-A", FWD.RC)

for(i in seq_along(fnFsman)) { system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2,
"-o 1> report.txt", fnFs.cut.Man[i], "-p 1> report.txt", fnRs.cut.Man[i],
fnFs.filtN.Man[i], fnRs.filtN.Man[i])) }

It works but the arguments for the txt output files or "report.txt" only take in to account the last summary data ( the one for :[ 8=---------] 00:00:05 78,237 reads @ 74.7 µs/read; 0.80 M reads/minute) and I would like to have them all.

I would be very grateful if someone could make some observations/ corrections for this issue.

3 months ago

summary statistics can be saved by

cutadapt command > summary_statistics.txt


stderr can be saved by

cutadapt command 2> stderr.txt