Question: How to get proportion of specific sequence in fastq file
0
gravatar for lkianmehr
12 weeks ago by
lkianmehr30
France
lkianmehr30 wrote:

Hello,

Do you have any idea about How to get proportion of specific sequence (defined as a separate fastq file=

@1
TAACCCTAACCCTAACCCTAACCC
+
JJJJJJJJJJJJJJJJJJJJJJJJ
@2
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

) in several paired-read (fastq) or Bam file of RNA-seq

thanks in advance

ADD COMMENTlink modified 12 weeks ago by shenwei3564.6k • written 12 weeks ago by lkianmehr30

Do you mean getting the first half of every read and save them as a separate file? Did you try anything yet?

ADD REPLYlink written 12 weeks ago by ATpoint16k

I just want to know how many of these specific sequence are there in these paired-read sequence to get proportion of that between all? I have tried by

 bbduk.sh in1=D1_L001_1.fastq.gz in2=D1_L001_2 out=D1.fq literal=TAACCCTAACCCTAACCCTAACCC k=24 mm=f int=f

but this command just extract all reads contain the sequence I think. I also have used Salmon to make an Index of fasta of this sequence and quantify all reads on, but was not good trick

ADD REPLYlink modified 12 weeks ago by Vijay Lakhujani4.1k • written 12 weeks ago by lkianmehr30
1

If TAACCCTAACCCTAACCCTAACCC is the full length sequence you are trying to identify you should be able to get the number that contain the sequence by just running the command and not providing an out= file name. This will do the operation and generate relevant statistics file. Sounds like that is all you are interested in. There should be a number for total sequences processed.

ADD REPLYlink written 12 weeks ago by genomax67k

I have got the result like below, what does it mean? contaminants determine the number of reads including that specific sequence or something else?

Input:                      65975862 reads      6554014910 bases.
Contaminants:               195232 reads (0.30%)    19519262 bases (0.30%)
Total Removed:              1040136 reads (1.58%)   61775988 bases (0.94%)
Result:                     64935726 reads (98.42%)     6492238922 bases (99.06%)

Time:                           133.851 seconds.
Reads Processed:      65975k    492.91k reads/sec
Bases Processed:       6554m    48.97m bases/sec
ADD REPLYlink modified 12 weeks ago by genomax67k • written 12 weeks ago by lkianmehr30
1

Is that the output from the command you included above? BBDuk by default works in filter mode. Take a look at this guide to get additional details (and also look at the in-line help).

At first glance Total Removed: 1040136 reads (1.58%) should be the reads that have the sequence you specified in the literal directive. Since you are running this in PE mode either/both reads may have the sequence. If you want to be specific you could run the files individually to get a more precise number.

ADD REPLYlink written 12 weeks ago by genomax67k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1697 users visited in the last hour