Question

Count pooled sequencing barcode in fastq

0

Entering edit mode

3.9 years ago

vinaykusuma ▴ 10

I have a fastq file where each read has a barcode, each barcode corresponds to a individual. barcode resides in the middle of the read. The fastq file has 6 barcodes. each barcode is surrounded by a primer. I want a way to quantify(count) the number of times each barcode appears in the fastq file also taking sequencing error of reads into account.

I wrote a program to do that but it performs extremely poorly when there are 10k barcodes in a big fastq. The fastq data is from ONT machine.

I'm looking for any available tool to do my task, I already tried searching on google but I get tools on barcode demultiplexing.

Any help will be appreciated.

sequencing alignment genome next-gen sequence • 1.8k views

ADD COMMENT • link 3.9 years ago by vinaykusuma ▴ 10

1

Entering edit mode

If you open-source and share your code on Github/GitLab, any one can help you to optimize it

ADD REPLY • link 3.9 years ago by JC 13k

0

Entering edit mode

I will surely do it, but I need a quick fix now.

ADD REPLY • link 3.9 years ago by vinaykusuma ▴ 10

1

Entering edit mode

Since this appears to be LAMPseq data there is some software made available here. It is not for long reads but may be usable.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

I am aware of that software. But I felt there is a easier way to do this. BBDuk is a easy way and it can be optimised to get conservative estimation of barcodes. Thanks for the answer

ADD REPLY • link 3.9 years ago by vinaykusuma ▴ 10

score 3 · Accepted Answer · 2020-06-16

3

Entering edit mode

3.9 years ago

GenoMax 141k

If you just need to count presence of a specific sequence inside (?) a read then you can use bbduk.sh from BBMap suite. You do something like bbduk.sh in=your.fq.gz literal=bacode_seq. You can easily filter those sequences using outm= and one of the other options. Use hdist=N to allow N errors.

ADD COMMENT • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Thank you for the reply

I used this command

bbduk.sh in=lampseq.fastq literal=CCCATAAATC,CACTTAGGAG,CGACCTAGGA

Its not working.

Added 0 kmers; time: 0.003 seconds. Memory: max=1468m, total=1468m, free=1436m, used=32m

Input is being processed as unpaired Processing time: 0.070 seconds.

Input: 18281 reads 2394811 bases. Contaminants: 0 reads (0.00%) 0 bases (0.00%) Total Removed: 0 reads (0.00%) 0 bases (0.00%) Result: 18281 reads (100.00%) 2394811 bases (100.00%)

Time: 0.075 seconds. Reads Processed: 18281 243.76k reads/sec Bases Processed: 2394k 31.93m bases/sec

ADD REPLY • link 3.9 years ago by vinaykusuma ▴ 10

1

Entering edit mode

You need to provide some detail about the read structure you have. Where is this barcode exactly going to be present (assuming it is in the read itself, is it on left or right or anywhere in read etc). You should take a look at this guide for bbduk.sh to consider some additional parameters you need (namely k=4 etc).

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Thanks a lot, I found the required options at it works @genomax.

ADD REPLY • link 3.9 years ago by vinaykusuma ▴ 10

1

Entering edit mode

Great. Another thing to note. bbduk.sh is multi-threaded so as long as your storage system supports it you can use threads= option to start more than one thread. I moved my comment to an answer so you can accept is (green check mark).

ADD REPLY • link 3.9 years ago by GenoMax 141k