How do I find the percentage of paired-end fastq files containing a given string
1
0
Entering edit mode
3.2 years ago
lkianmehr ▴ 100

I need to calculate the percentage of paired-end fastq files containing the string "TAACCCTAACCCTAACCCTAACCC ". So I used bbduk.sh in1=1.fastq.gz in2=2.fastq.gz literal=TAACCCTAACCCTAACCCTAACCC k=24 mm=f int=f

and I got :

Input:                      65975862 reads      6554014910 bases.
Contaminants:               195232 reads (0.30%)    19519262 bases (0.30%)
Total Removed:              1040136 reads (1.58%)   61775988 bases (0.94%)
Result:                     64935726 reads (98.42%)     6492238922 bases (99.06%)

Should I consider the total removed (1.58%) as the percent of that string in paired-end fastq files?

In addition, I am using grep, this command : grep -A 2 -B 1 ' TAACCCTAACCCTAACCCTAACCC ' D1_TTAGGC_L001_R1_001.fastq.gz | sed '/--/d' > out_D1_R1.fq.

It gives about 7526 lines containing the string. I divided by total sequences (32987931) to get the percent of the string 7526/32987931= 0.02. Does it mean only forward fastq file have 0.02 of that string?

Thanks

fastq bbduk • 487 views
ADD COMMENT
0
Entering edit mode
3.2 years ago
GenoMax 142k

Total removed are the reads that contain that string. You can check those reads by using outm1= and outm2= directive and collect them in files.

grep does not look for the reverse complement match so keep that in mind. It is also going to do perfect matching without allowing for sequence errors.

ADD COMMENT

Login before adding your answer.

Traffic: 1322 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6