Question

Tools for demultiplexing a large fastq file based on random in-line barcodes

1

Entering edit mode

7.7 years ago

alyssamolinaro91 ▴ 10

Hi all,

I will be using Drop-seq to prepare cDNA libraries for thousands of single cells, followed by single-end sequencing on a HiSeq2500 (read length = 100 bases). This will involve the addition of a unique 12 nucleotide cell barcode to the 5' end of all reads originating from the same cell (so bases 1-12 of each read). Unique molecular identifiers will also be used (they will be bases 13-20 of the reads). I won't be able to use the pipeline designed by the creators of drop-seq because they require paired-end sequencing. My problem is with demultiplexing: I am using randomly generated cell barcodes which will be supplied in excess, so I have no way of knowing beforehand which barcodes were actually used. Because of this, I am unable to supply the barcode sequence information that most scripts out there require as input. As someone with minimal computational/bioinformatics skills, I am not comfortable writing custom scripts to fit my needs.

Does anyone know of any scripts/packages that I would be able to use to demultiplex my data based on the in-line cell barcodes? I am planning on using the python package UMI tools (https://github.com/CGATOxford/UMI-tools) to extract the UMIs and deduplicate reads, but I have not been able to find any information on how to separate the reads from different cells into distinct fastq files.

Thanks in advance for any recommendations!

RNA-Seq demultiplexing drop-seq • 10k views

ADD COMMENT • link updated 6.5 years ago by Biostar 20 • written 7.7 years ago by alyssamolinaro91 ▴ 10

1

Entering edit mode

7.7 years ago

Asaf 10k

I once wrote a script that split reads according to their barcodes and when it meets a new barcode, not in the input table it opens a new file for it. The script is at https://github.com/asafpr/RNAseq_scripts/blob/master/index_splitter.py you should prepare an input table with a custom barcode in the length of you expected barcodes and run with -u. I hope it will work fine, I only tested it on NextSeq sequencing results.

ADD COMMENT • link 7.7 years ago by Asaf 10k

0

Entering edit mode

6.7 years ago

i.sudbery 19k

The new version of UMI-Tools now has mechanisms for dealing with data that has inline cell barcodes.

ADD COMMENT • link 6.7 years ago by i.sudbery 19k

score 2 · Accepted Answer · 2016-08-07

2

Entering edit mode

7.7 years ago

igor 13k

I believe fastq-multx can do help. It works with in-read barcodes. The -n flag should "print likely barcode list". See: https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqMultx.md

If that doesn't help, you can determine which barcodes you have and extract the barcode sequences yourself (read FASTQ file, get the lines with the nucleotide sequence, take the first 12 characters, sort, keep only unique sequences):

zcat file.fastq.gz | sed -n '2~4p' | cut -c 1-12 | sort | uniq

But that will be noisy. To get the most occurring barcodes (more likely to be real) which you can use for demultiplexing afterwards:

zcat file.fastq.gz | sed -n '2~4p' | cut -c 1-12 | sort | uniq -c | sort -nr | head -100

ADD COMMENT • link 7.7 years ago by igor 13k

0

Entering edit mode

Hi Igor, your zcat solution appears to have done the trick. I used it to generate a barcode list, which I then used as the barcode input file for fastx_barcode_splitter (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html). I now have demultiplexed fastq files. One strange thing though, when I aligned the fastqs with bowtie2 I ended up getting 100% of reads aligning >1 times. The reads that I'm using at the moment are very short (12 bases after trimming off barcodes/UMIs, they were designed with a different analysis pipeline in mind) so I wonder if that might have something to do with the strange alignments. Anyway, I think once I generate my own data I will be able to troubleshoot the issue better.

ADD REPLY • link 7.7 years ago by alyssamolinaro91 ▴ 10

0

Entering edit mode

It's not strange. It's very hard to get a unique alignment with 12 bases. 100% multi-mapping seems a bit excessive, but I wouldn't have expected less than 90%. I don't think you can expect a high fraction of unique alignments below 25 bases.

See earlier discussion here: Length Of Read Needed To Confidently Map Sequence

ADD REPLY • link 7.7 years ago by igor 13k