Question

Determining total nucleotides for paired end metagenomic sequences

1

Entering edit mode

5.0 years ago

cssulliv ▴ 20

Hi, I currently have R1 and R1 fastq files for my sample_2. These were already run through quality control (QC) and I wanted to compare the number of nucleotides before and after QC.

I have found a script online that allows me to count the total number of sequences and nucleotides for each fastq file (See below). My question is to get the total number of nucleotides for the sample (and not R1 or R2 alone) do I simply add the total # of nucleotides from R1 and R2 (10,438,428,166+10,398,217,322)? Or do I have to merge R1 and R2 then run zgrep to get the total count?

zgrep . sample2_QUALITY_PASSED_R1.00.0_0.cor.fastq.gz |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c; 
                print "Number of bases in reads: "l
              }'

Number of reads: 70,369,761
Number of bases in reads: 10,438,428,166

zgrep . sample2_QUALITY_PASSED_R2.00.0_0.cor.fastq.gz |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c; 
                print "Number of bases in reads: "l
              }'
Number of reads: 70,369,761
Number of bases in reads: 10,398,217,322

Thank you in advance for your time and help!

sequence • 1.7k views

ADD COMMENT • link updated 5.0 years ago by GenoMax 141k • written 5.0 years ago by cssulliv ▴ 20

1

Entering edit mode

Use seqkit for saving time.

seqkit stats xxxx_R[12].*.fastq.gz

Results are something like these:

$ seqkit stats  reads_*.fq.gz
file           format  type  num_seqs  sum_len  min_len  avg_len  max_len
reads_1.fq.gz  FASTQ   DNA      2,500  567,516      226      227      229
reads_2.fq.gz  FASTQ   DNA      2,500  560,002      223      224      225

$ seqkit stats  reads_*.fq.gz -a
file           format  type  num_seqs  sum_len  min_len  avg_len  max_len   Q1   Q2   Q3  sum_gap  N50  Q20(%)  Q30(%)
reads_1.fq.gz  FASTQ   DNA      2,500  567,516      226      227      229  227  227  227        0  227   91.24   86.62
reads_2.fq.gz  FASTQ   DNA      2,500  560,002      223      224      225  224  224  224        0  224   91.06   87.66

ADD REPLY • link 5.0 years ago by shenwei356 8.4k

score 1 · Answer 1 · 2019-04-14

1

Entering edit mode

5.0 years ago

GenoMax 141k

My question is to get the total number of nucleotides for the sample (and not R1 or R2 alone) do I simply add the total # of nucleotides from R1 and R2 (10,438,428,166+10,398,217,322)? Or do I have to merge R1 and R2 then run zgrep to get the total count?

If you want to get unique nucleotides that have been sequenced from that sample then you will need to merge/trim (to remove any adapters) the data and then count the remaining bases.

If your reads don't overlap (and thus can't be merged) and have no adapter sequence then you can simply add the R1/R2 counts.

Use @shenwei's solution in that case.

ADD COMMENT • link 5.0 years ago by GenoMax 141k

0

Entering edit mode

Better late than never but mahalo nui loa to you and @ shenwei for your help.

Just a clarification though, how would I know if the reads don't/do overlap? I essentially would like to get the total bp for this metagenomic sample for normalizing read recruitment data as the R1 and R2 sequences were also used to generate mapping files (.bam). :)

ADD REPLY • link 3.5 years ago by cssulliv ▴ 20

0

Entering edit mode

You can use bbmerge.sh from BBMap suite to see if the reads are overlapping. A guide is available.

ADD REPLY • link 3.5 years ago by GenoMax 141k