Cell Barcode Identification and Counting
8 months ago

Hi All,

I completed a single cell DNA barcoding experiment and have the .fastq file. The reads in the .fastq file are the 40 bp cell barcodes. Is there a way to count the frequency of the barcodes in the .fastq file de novo? I would like to determine how many different 40 bp barcodes are present in the population, and then count them.

Maybe this needs to be taken in 2 steps. The first is to know the bar code sequences that are present? They to use those sequences in a counting step?

Is there any advice on how to do this?


Try UMI-tools: Or UMI-tools https://umi-tools.readthedocs.io/en/latest/reference/whitelist.html

The difficulty is to decide which detected barcodes are real and which are just noise. Read through the docs, it explains this.

8 months ago

I just dealt with a similar problem with semi-random 30 bp barcodes. I found this recent review extremely helpful, as it was my first time dealing with random barcodes. Barcode synthesis is an imperfect process, especially for runs of homopolymers or dinucleotides - indels are extremely common. Barcode correction is necessary if you want accurate counts.

My approach to look at engraftment efficiency of a cell line in a xenograft mouse model was basically:

  • Take a look at the reads and trim primer/adapter sequences via cutadapt, leaving only the supposed barcode.
  • Correct barcodes & count with starcode.
  • For each sample, rank barcodes by count and consider the barcodes accounting for the first 90% of cumulative reads as "reasonably expressed". In practice, this results in most barcodes with very few counts being ignored:

enter image description here

Adjust as necessary for your application and experimental setup.


