Question

Dealing with 1000's of demultiplexed fastq files

0

Entering edit mode

5.0 years ago

volvicpellegrino ▴ 40

I am trying to process single cell data from a public dataset, containing over 6000 single-end 51 bp fastq files. Each fastq file represents a single cell and contains a 9-11 bp UMI, leaving 40 bp for mapping.

I have used UMI-tools to extract the UMI sequence from every read in every file.

Is there an efficient way to handle such a large number of files for STAR or kallisto? It feels as though this process would be faster if I had fewer, barcoded, larger fastq files?

STAR zUMIs UMI-tools • 3.3k views

ADD COMMENT • link updated 4.8 years ago by Lior Pachter ▴ 700 • written 5.0 years ago by volvicpellegrino ▴ 40

0

Entering edit mode

Individual files you have are what size? Keeping the genome index in memory, using multiple threads you may be able to wade through these quicker than you imagine.

One could always fire up 6000 VM's on cloud and be done. That is if money is no object :-)

ADD REPLY • link 5.0 years ago by GenoMax 141k

0

Entering edit mode

Thanks for everyone's help! In the end, I tried both approaches: running STAR mapping jobs in parallel (worked, very fast), however I am now going down the route of merging the fastq files together with artificial barcodes inserted, as I would like to try the zUMIs package, which only handles 4 fastq files at a time. I used the Illumina list of barcodes to insert a unique barcode into the start of every read in every file.

ADD REPLY • link 5.0 years ago by volvicpellegrino ▴ 40

score 4 · Answer 1 · 2019-04-11

4

Entering edit mode

5.0 years ago

i.sudbery 19k

Your sense that this would be easier to deal with if these files were in a single not demulplexed, barcoded fastq file is correct. However, given that you are where you are, really I can see only 2 ways forward:

Either map each of the 6000 files indepenently and then to read counting on each file independently....

Or manipulate the fastq files to add an artificial barcode to each one, and then cat them together, and then map and count them as a single multiplexed file.

ADD COMMENT • link 5.0 years ago by i.sudbery 19k

0

Entering edit mode

If you are going to go ahead and map the files separately (which I guess is least hacky thing to do), I definitely recommend mapping to the transcriptome rather than the genome. This will mean you don't have to go to the bother of assigning reads to genes. It will also speed up downstream deduping/counting operations. It has some problems, but in this case I think they are outweighed by the benefits (see list of problems here).

In this case, I probably recommend generateing as your transcriptome, the collapsed transcriptome described above. The idea is to collapse all transcripts for a gene into a single "superTranscript" (similar to described by Davidson et al) and map to that.

There are probably many ways to do this, but I do it with 'cgat-apps` :

$ wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz
$   cgat gtf2gtf --method=merge-exons -I gencode.v26.annotation.gtf.gz \
  | cgat gtf2gtf --method=set-transcript-to-gene -S hg38_genocode26_merged.gtf.gz

You can then convert the gtf to fasta using your favorite converter (e.g. cgat gff2fasta), index with STAR and map to that. You can then use umi_tools dedup and get the per-gene counts by running samtools idxstat on the resulting BAM file.

If you script it with a workflow manager, (e.g. snakemake or cgat-core) it probably wouldn't be the end of the world in terms of run time.

ADD REPLY • link 5.0 years ago by i.sudbery 19k

score 2 · Answer 2 · 2019-04-11

2

Entering edit mode

5.0 years ago

WouterDeCoster 47k

Note that with STAR you can load the index in memory and then share it by multiple alignment processes. That's definitely going to speed up your processing. It's described in the manual,

ADD COMMENT • link 5.0 years ago by WouterDeCoster 47k

score 2 · Answer 3 · 2019-06-28

2

Entering edit mode

4.8 years ago

Lior Pachter ▴ 700

The kallisto | bustools workflow will allow you to specify the barcode / UMI /cDNA read structure and should be helpful for your analysis.

ADD COMMENT • link 4.8 years ago by Lior Pachter ▴ 700

score 0 · Answer 4 · 2019-04-11

0

Entering edit mode

5.0 years ago

Charles Warden 8.2k

I believe STAR has added some scRNA-Seq functions (STARsolo), but I have not tested them myself: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf (currently, STARsolo section is pages #18-19).

Otherwise, I would probably use cellranger for alignment and quantification (but likely use additional filtering for different rounds of analysis).

ADD COMMENT • link 5.0 years ago by Charles Warden 8.2k

1

Entering edit mode

Cellranger uses STAR to align under the hood.

ADD REPLY • link 5.0 years ago by GenoMax 141k

0

Entering edit mode

That's true.

I have to admit that I'm not exactly sure when you would want to use STARsolo instead of cellranger. I see that they have some parameters for UMIs: perhaps this helps avoid making some intermediate files?

ADD REPLY • link 5.0 years ago by Charles Warden 8.2k

1

Entering edit mode

See the alevin paper - cellranger is really slow, and uses LOTs of memory. The we found that its results were very similar to a naive pipeline using umi_tools and STAR but took longer and used more memory. Of course alevin is even quicker, uses even less memory, and was more accurate (as far as we could tell - such things are not easy to figure out).

STARsolo will also allow for some parameterisation of the UMIs - you can specify how many bases are CB and how many UMI, but I think they have to come in that order and at the start of read1.

ADD REPLY • link 5.0 years ago by i.sudbery 19k

0

Entering edit mode

Thank you for the additional information!

ADD REPLY • link 5.0 years ago by Charles Warden 8.2k

1

Entering edit mode

I'm pretty sure you can only use cellranger with 10x data. This definately doesn't sound like 10x data.

ADD REPLY • link 5.0 years ago by i.sudbery 19k

0

Entering edit mode

That is a good point - 10X provides a way to do basecalling without mkfastq (with regular bcl2fastq), but I think the other functions would be more complicated to modify. If you have separate files per cell, then that wouldn't be the raw 10X data (and I think the most raw form of data should be deposited into the SRA).

However, perhaps providing the public accession to confirm would be helpful?

ADD REPLY • link 5.0 years ago by Charles Warden 8.2k