Question: Tools for demultiplexing a large fastq file based on random in-line barcodes
0
gravatar for alyssamolinaro91
14 months ago by
Canada
alyssamolinaro910 wrote:

Hi all,

I will be using Drop-seq to prepare cDNA libraries for thousands of single cells, followed by single-end sequencing on a HiSeq2500 (read length = 100 bases). This will involve the addition of a unique 12 nucleotide cell barcode to the 5' end of all reads originating from the same cell (so bases 1-12 of each read). Unique molecular identifiers will also be used (they will be bases 13-20 of the reads). I won't be able to use the pipeline designed by the creators of drop-seq because they require paired-end sequencing. My problem is with demultiplexing: I am using randomly generated cell barcodes which will be supplied in excess, so I have no way of knowing beforehand which barcodes were actually used. Because of this, I am unable to supply the barcode sequence information that most scripts out there require as input. As someone with minimal computational/bioinformatics skills, I am not comfortable writing custom scripts to fit my needs.

Does anyone know of any scripts/packages that I would be able to use to demultiplex my data based on the in-line cell barcodes? I am planning on using the python package UMI tools (https://github.com/CGATOxford/UMI-tools) to extract the UMIs and deduplicate reads, but I have not been able to find any information on how to separate the reads from different cells into distinct fastq files.

Thanks in advance for any recommendations!

ADD COMMENTlink modified 10 days ago by Biostar ♦♦ 20 • written 14 months ago by alyssamolinaro910
2
gravatar for igor
14 months ago by
igor4.6k
United States
igor4.6k wrote:

I believe fastq-multx can do help. It works with in-read barcodes. The -n flag should "print likely barcode list". See: https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqMultx.md

If that doesn't help, you can determine which barcodes you have and extract the barcode sequences yourself (read FASTQ file, get the lines with the nucleotide sequence, take the first 12 characters, sort, keep only unique sequences):

zcat file.fastq.gz | sed -n '2~4p' | cut -c 1-12 | sort | uniq

But that will be noisy. To get the most occurring barcodes (more likely to be real) which you can use for demultiplexing afterwards:

zcat file.fastq.gz | sed -n '2~4p' | cut -c 1-12 | sort | uniq -c | sort -nr | head -100
ADD COMMENTlink modified 14 months ago • written 14 months ago by igor4.6k

Hi Igor, your zcat solution appears to have done the trick. I used it to generate a barcode list, which I then used as the barcode input file for fastx_barcode_splitter (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html). I now have demultiplexed fastq files. One strange thing though, when I aligned the fastqs with bowtie2 I ended up getting 100% of reads aligning >1 times. The reads that I'm using at the moment are very short (12 bases after trimming off barcodes/UMIs, they were designed with a different analysis pipeline in mind) so I wonder if that might have something to do with the strange alignments. Anyway, I think once I generate my own data I will be able to troubleshoot the issue better.

ADD REPLYlink written 14 months ago by alyssamolinaro910

It's not strange. It's very hard to get a unique alignment with 12 bases. 100% multi-mapping seems a bit excessive, but I wouldn't have expected less than 90%. I don't think you can expect a high fraction of unique alignments below 25 bases.

See earlier discussion here: Length Of Read Needed To Confidently Map Sequence

ADD REPLYlink modified 14 months ago • written 14 months ago by igor4.6k
1
gravatar for Asaf
14 months ago by
Asaf4.5k
Israel
Asaf4.5k wrote:

I once wrote a script that split reads according to their barcodes and when it meets a new barcode, not in the input table it opens a new file for it. The script is at https://github.com/asafpr/RNAseq_scripts/blob/master/index_splitter.py you should prepare an input table with a custom barcode in the length of you expected barcodes and run with -u. I hope it will work fine, I only tested it on NextSeq sequencing results.

ADD COMMENTlink written 14 months ago by Asaf4.5k
0
gravatar for i.sudbery
8 weeks ago by
i.sudbery1.6k
Sheffield, UK
i.sudbery1.6k wrote:

The new version of UMI-Tools now has mechanisms for dealing with data that has inline cell barcodes.

ADD COMMENTlink written 8 weeks ago by i.sudbery1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 955 users visited in the last hour