Question: Determining total nucleotides for paired end metagenomic sequences
1
gravatar for cssulliv
18 months ago by
cssulliv20
cssulliv20 wrote:

Hi, I currently have R1 and R1 fastq files for my sample_2. These were already run through quality control (QC) and I wanted to compare the number of nucleotides before and after QC.

I have found a script online that allows me to count the total number of sequences and nucleotides for each fastq file (See below). My question is to get the total number of nucleotides for the sample (and not R1 or R2 alone) do I simply add the total # of nucleotides from R1 and R2 (10,438,428,166+10,398,217,322)? Or do I have to merge R1 and R2 then run zgrep to get the total count?

zgrep . sample2_QUALITY_PASSED_R1.00.0_0.cor.fastq.gz |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c; 
                print "Number of bases in reads: "l
              }'

Number of reads: 70,369,761
Number of bases in reads: 10,438,428,166

zgrep . sample2_QUALITY_PASSED_R2.00.0_0.cor.fastq.gz |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c; 
                print "Number of bases in reads: "l
              }'
Number of reads: 70,369,761
Number of bases in reads: 10,398,217,322

Thank you in advance for your time and help!

sequence • 478 views
ADD COMMENTlink modified 18 months ago by genomax91k • written 18 months ago by cssulliv20
1

Use seqkit for saving time.

seqkit stats xxxx_R[12].*.fastq.gz

Results are something like these:

$ seqkit stats  reads_*.fq.gz
file           format  type  num_seqs  sum_len  min_len  avg_len  max_len
reads_1.fq.gz  FASTQ   DNA      2,500  567,516      226      227      229
reads_2.fq.gz  FASTQ   DNA      2,500  560,002      223      224      225

$ seqkit stats  reads_*.fq.gz -a
file           format  type  num_seqs  sum_len  min_len  avg_len  max_len   Q1   Q2   Q3  sum_gap  N50  Q20(%)  Q30(%)
reads_1.fq.gz  FASTQ   DNA      2,500  567,516      226      227      229  227  227  227        0  227   91.24   86.62
reads_2.fq.gz  FASTQ   DNA      2,500  560,002      223      224      225  224  224  224        0  224   91.06   87.66

ADD REPLYlink modified 18 months ago • written 18 months ago by shenwei3565.5k
1
gravatar for genomax
18 months ago by
genomax91k
United States
genomax91k wrote:

My question is to get the total number of nucleotides for the sample (and not R1 or R2 alone) do I simply add the total # of nucleotides from R1 and R2 (10,438,428,166+10,398,217,322)? Or do I have to merge R1 and R2 then run zgrep to get the total count?

If you want to get unique nucleotides that have been sequenced from that sample then you will need to merge/trim (to remove any adapters) the data and then count the remaining bases.

If your reads don't overlap (and thus can't be merged) and have no adapter sequence then you can simply add the R1/R2 counts.

Use @shenwei's solution in that case.

ADD COMMENTlink written 18 months ago by genomax91k

Better late than never but mahalo nui loa to you and @ shenwei for your help.

Just a clarification though, how would I know if the reads don't/do overlap? I essentially would like to get the total bp for this metagenomic sample for normalizing read recruitment data as the R1 and R2 sequences were also used to generate mapping files (.bam). :)

ADD REPLYlink written 3 days ago by cssulliv20

You can use bbmerge.sh from BBMap suite to see if the reads are overlapping. A guide is available.

ADD REPLYlink written 3 days ago by genomax91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1710 users visited in the last hour