Question

How do I find the percentage of paired-end fastq files containing a given string

0

Entering edit mode

3.2 years ago

lkianmehr ▴ 100

I need to calculate the percentage of paired-end fastq files containing the string "TAACCCTAACCCTAACCCTAACCC ". So I used bbduk.sh in1=1.fastq.gz in2=2.fastq.gz literal=TAACCCTAACCCTAACCCTAACCC k=24 mm=f int=f

and I got :

Input:                      65975862 reads      6554014910 bases.
Contaminants:               195232 reads (0.30%)    19519262 bases (0.30%)
Total Removed:              1040136 reads (1.58%)   61775988 bases (0.94%)
Result:                     64935726 reads (98.42%)     6492238922 bases (99.06%)

Should I consider the total removed (1.58%) as the percent of that string in paired-end fastq files?

In addition, I am using grep, this command : grep -A 2 -B 1 ' TAACCCTAACCCTAACCCTAACCC ' D1_TTAGGC_L001_R1_001.fastq.gz | sed '/--/d' > out_D1_R1.fq.

It gives about 7526 lines containing the string. I divided by total sequences (32987931) to get the percent of the string 7526/32987931= 0.02. Does it mean only forward fastq file have 0.02 of that string?

Thanks

fastq bbduk • 487 views

ADD COMMENT • link updated 3.2 years ago by GenoMax 142k • written 3.2 years ago by lkianmehr ▴ 100

score 0 · Answer 1 · 2021-03-10

0

Entering edit mode

3.2 years ago

GenoMax 142k

Total removed are the reads that contain that string. You can check those reads by using outm1= and outm2= directive and collect them in files.

grep does not look for the reverse complement match so keep that in mind. It is also going to do perfect matching without allowing for sequence errors.

ADD COMMENT • link 3.2 years ago by GenoMax 142k