Question: Determining total nucleotides for paired end metagenomic sequences
0
gravatar for cssulliv
11 days ago by
cssulliv0
cssulliv0 wrote:

Hi, I currently have R1 and R1 fastq files for my sample_2. These were already run through quality control (QC) and I wanted to compare the number of nucleotides before and after QC.

I have found a script online that allows me to count the total number of sequences and nucleotides for each fastq file (See below). My question is to get the total number of nucleotides for the sample (and not R1 or R2 alone) do I simply add the total # of nucleotides from R1 and R2 (10,438,428,166+10,398,217,322)? Or do I have to merge R1 and R2 then run zgrep to get the total count?

zgrep . sample2_QUALITY_PASSED_R1.00.0_0.cor.fastq.gz |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c; 
                print "Number of bases in reads: "l
              }'

Number of reads: 70,369,761
Number of bases in reads: 10,438,428,166

zgrep . sample2_QUALITY_PASSED_R2.00.0_0.cor.fastq.gz |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c; 
                print "Number of bases in reads: "l
              }'
Number of reads: 70,369,761
Number of bases in reads: 10,398,217,322

Thank you in advance for your time and help!

sequence • 98 views
ADD COMMENTlink modified 10 days ago by genomax65k • written 11 days ago by cssulliv0

Use seqkit for saving time.

seqkit stats xxxx_R[12].*.fastq.gz

Results are something like these:

$ seqkit stats  reads_*.fq.gz
file           format  type  num_seqs  sum_len  min_len  avg_len  max_len
reads_1.fq.gz  FASTQ   DNA      2,500  567,516      226      227      229
reads_2.fq.gz  FASTQ   DNA      2,500  560,002      223      224      225

$ seqkit stats  reads_*.fq.gz -a
file           format  type  num_seqs  sum_len  min_len  avg_len  max_len   Q1   Q2   Q3  sum_gap  N50  Q20(%)  Q30(%)
reads_1.fq.gz  FASTQ   DNA      2,500  567,516      226      227      229  227  227  227        0  227   91.24   86.62
reads_2.fq.gz  FASTQ   DNA      2,500  560,002      223      224      225  224  224  224        0  224   91.06   87.66

ADD REPLYlink modified 11 days ago • written 11 days ago by shenwei3564.6k
0
gravatar for genomax
10 days ago by
genomax65k
United States
genomax65k wrote:

My question is to get the total number of nucleotides for the sample (and not R1 or R2 alone) do I simply add the total # of nucleotides from R1 and R2 (10,438,428,166+10,398,217,322)? Or do I have to merge R1 and R2 then run zgrep to get the total count?

If you want to get unique nucleotides that have been sequenced from that sample then you will need to merge/trim (to remove any adapters) the data and then count the remaining bases.

If your reads don't overlap (and thus can't be merged) and have no adapter sequence then you can simply add the R1/R2 counts.

Use @shenwei's solution in that case.

ADD COMMENTlink written 10 days ago by genomax65k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1801 users visited in the last hour