Question: Dealing with 1000's of demultiplexed fastq files
0
gravatar for volvicpellegrino
10 months ago by
volvicpellegrino0 wrote:

I am trying to process single cell data from a public dataset, containing over 6000 single-end 51 bp fastq files. Each fastq file represents a single cell and contains a 9-11 bp UMI, leaving 40 bp for mapping.

I have used UMI-tools to extract the UMI sequence from every read in every file.

Is there an efficient way to handle such a large number of files for STAR or kallisto? It feels as though this process would be faster if I had fewer, barcoded, larger fastq files?

umi-tools star zumis • 716 views
ADD COMMENTlink modified 8 months ago by Lior Pachter480 • written 10 months ago by volvicpellegrino0

Individual files you have are what size? Keeping the genome index in memory, using multiple threads you may be able to wade through these quicker than you imagine.

One could always fire up 6000 VM's on cloud and be done. That is if money is no object :-)

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax78k

Thanks for everyone's help! In the end, I tried both approaches: running STAR mapping jobs in parallel (worked, very fast), however I am now going down the route of merging the fastq files together with artificial barcodes inserted, as I would like to try the zUMIs package, which only handles 4 fastq files at a time. I used the Illumina list of barcodes to insert a unique barcode into the start of every read in every file.

ADD REPLYlink written 10 months ago by volvicpellegrino0
4
gravatar for i.sudbery
10 months ago by
i.sudbery7.0k
Sheffield, UK
i.sudbery7.0k wrote:

Your sense that this would be easier to deal with if these files were in a single not demulplexed, barcoded fastq file is correct. However, given that you are where you are, really I can see only 2 ways forward:

Either map each of the 6000 files indepenently and then to read counting on each file independently....

Or manipulate the fastq files to add an artificial barcode to each one, and then cat them together, and then map and count them as a single multiplexed file.

ADD COMMENTlink written 10 months ago by i.sudbery7.0k

If you are going to go ahead and map the files separately (which I guess is least hacky thing to do), I definitely recommend mapping to the transcriptome rather than the genome. This will mean you don't have to go to the bother of assigning reads to genes. It will also speed up downstream deduping/counting operations. It has some problems, but in this case I think they are outweighed by the benefits (see list of problems here).

In this case, I probably recommend generateing as your transcriptome, the collapsed transcriptome described above. The idea is to collapse all transcripts for a gene into a single "superTranscript" (similar to described by Davidson et al) and map to that.

There are probably many ways to do this, but I do it with 'cgat-apps` :

$ wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz
$   cgat gtf2gtf --method=merge-exons -I gencode.v26.annotation.gtf.gz \
  | cgat gtf2gtf --method=set-transcript-to-gene -S hg38_genocode26_merged.gtf.gz

You can then convert the gtf to fasta using your favorite converter (e.g. cgat gff2fasta), index with STAR and map to that. You can then use umi_tools dedup and get the per-gene counts by running samtools idxstat on the resulting BAM file.

If you script it with a workflow manager, (e.g. snakemake or cgat-core) it probably wouldn't be the end of the world in terms of run time.

ADD REPLYlink modified 10 months ago • written 10 months ago by i.sudbery7.0k
2
gravatar for WouterDeCoster
10 months ago by
Belgium
WouterDeCoster43k wrote:

Note that with STAR you can load the index in memory and then share it by multiple alignment processes. That's definitely going to speed up your processing. It's described in the manual,

ADD COMMENTlink written 10 months ago by WouterDeCoster43k
2
gravatar for Lior Pachter
8 months ago by
Lior Pachter480
United States
Lior Pachter480 wrote:

The kallisto | bustools workflow will allow you to specify the barcode / UMI /cDNA read structure and should be helpful for your analysis.

ADD COMMENTlink written 8 months ago by Lior Pachter480
0
gravatar for Charles Warden
10 months ago by
Charles Warden7.6k
Duarte, CA
Charles Warden7.6k wrote:

I believe STAR has added some scRNA-Seq functions (STARsolo), but I have not tested them myself: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf (currently, STARsolo section is pages #18-19).

Otherwise, I would probably use cellranger for alignment and quantification (but likely use additional filtering for different rounds of analysis).

ADD COMMENTlink modified 10 months ago • written 10 months ago by Charles Warden7.6k
1

Cellranger uses STAR to align under the hood.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax78k

That's true.

I have to admit that I'm not exactly sure when you would want to use STARsolo instead of cellranger. I see that they have some parameters for UMIs: perhaps this helps avoid making some intermediate files?

ADD REPLYlink written 10 months ago by Charles Warden7.6k
1

See the alevin paper - cellranger is really slow, and uses LOTs of memory. The we found that its results were very similar to a naive pipeline using umi_tools and STAR but took longer and used more memory. Of course alevin is even quicker, uses even less memory, and was more accurate (as far as we could tell - such things are not easy to figure out).

STARsolo will also allow for some parameterisation of the UMIs - you can specify how many bases are CB and how many UMI, but I think they have to come in that order and at the start of read1.

ADD REPLYlink written 10 months ago by i.sudbery7.0k

Thank you for the additional information!

ADD REPLYlink written 10 months ago by Charles Warden7.6k
1

I'm pretty sure you can only use cellranger with 10x data. This definately doesn't sound like 10x data.

ADD REPLYlink written 10 months ago by i.sudbery7.0k

That is a good point - 10X provides a way to do basecalling without mkfastq (with regular bcl2fastq), but I think the other functions would be more complicated to modify. If you have separate files per cell, then that wouldn't be the raw 10X data (and I think the most raw form of data should be deposited into the SRA).

However, perhaps providing the public accession to confirm would be helpful?

ADD REPLYlink modified 10 months ago • written 10 months ago by Charles Warden7.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1329 users visited in the last hour