Question

scRNA-seq data processing from 10X device

0

Entering edit mode

6.0 years ago

nikitavlassenko ▴ 110

I know how to do bulk RNA-seq analysis. First we merge 'fastq.gz' files, then do the alignment, etc. However, in case of scRNA-seq I am completely confused. There are no 'fastq' files in the first place: I did search manually and then using linux's 'find' function for '.fq.gz', 'fastq.gz' files. In the 'Data' folder - I suppose the data should be there - I have many of '.bcl.bgzf' files, each one around 60Mb in size, and 4 '.locs' of '1.1Gb' each. In the top folder I have the following folders: 'Data', 'Config', 'Images', 'InterOp', 'Logs', 'Queued', 'Recipe', 'RTALogs', 'thumbnail_Images'. Are there any tutorials on how to do that? I know there are workflows, e.g.:

https://f1000research.com/articles/4-1070/v2

But they do not talk about the initial preparation of the fastq files from which their tutorial starts and I suppose I need to prepare them by merging the data that I got.

Any suggestions would be greatly appreciated.

Update

Here is how the top folder looks like:

I do not have samplesheet file, definitely no .csv file with sample info provided, so I do not know which file to actually feed in to the suggested software packages, cellranger, etc. Also, I have 4 folders with '.bcl.bgzf' and '.bcl.bgzf.bci' data files and I do not know whether I should point to all of the folders at once, or treat them separately and then combine.

rna-seq 10x genomics scRNA-seq • 4.6k views

ADD COMMENT • link updated 6.0 years ago by GenoMax 141k • written 6.0 years ago by nikitavlassenko ▴ 110

1

Entering edit mode

If this is a one time deal you should ask your service provider to do the analysis for you. I am not sure if it is worth doing this yourself, especially if you don't have IT expertise available.

That said relevant software/protocols are available on 10x genomic's web site.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

This is definitely not a one-time deal and I will need to analyze many-many samples myself. I am a computer scientist, just with next to no experience in bioinformatics.

ADD REPLY • link 6.0 years ago by nikitavlassenko ▴ 110

0

Entering edit mode

We are here to help but it will take some patience. Once you get everything organized and complete one run then it would be a matter of starting a script and waiting for it to complete.

ADD REPLY • link 6.0 years ago by GenoMax 141k

1

Entering edit mode

6.0 years ago

Devon Ryan 104k

The Cell Ranger pipeline from 10X is meant for dealing with bcl files. Note that these are the files produced by the sequencer and are typically handled by your sequencing facility, so you'll end up needing to have bcl2fastq installed as well as STAR and other programs. Further information is available on 10X's website.

As an aside it's possible to demultiplex 10X data as normal and then feed it through cell ranger. This is talked about in the cell ranger documentation and is, in my opinion, more compatible with how sequencing facilities tend to produce data (they don't then need to handle 10X data differently). This also allows pooling 10X and other samples on the same lanes.

ADD COMMENT • link 6.0 years ago by Devon Ryan 104k

0

Entering edit mode

First index has to be saved as a separate file (and second ignored if it was sequenced) which is a deviation from standard bcl2fastq protocol.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

Which index and how to save it?

ADD REPLY • link 6.0 years ago by nikitavlassenko ▴ 110

1

Entering edit mode

If you choose to use Illumina bcl2fastq to do the demultiplexing follow the directions here. cellranger mkfastq protocol is described on this page.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

cellranger's specifications are not enough. First I do not have samplesheet file. There is definitely no .csv file provided. Second, I have 4 folders (L001, L002, L003, L004 inside 'BaseCalls' folder) with '.bcl.bgzf' and '.bcl.bgzf.bci' files, and I am not sure how I would point to all of them at the same time or that I should treat them separately and then combine, but how? In the top folder I have only two files that I could guess somehow related to samples info: 'RunInfo.xml', 'RunParameters.xml' but there are no sequences of genes, no indexes, just a bunch of numbers.

ADD REPLY • link 6.0 years ago by nikitavlassenko ▴ 110

1

Entering edit mode

See my answer below for details on how to do this.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

It doesn't actually. You end up not needing to save either index to a file, though you might have to merge some stuff (or rather, tell cellranger to do that).

ADD REPLY • link 6.0 years ago by Devon Ryan 104k

score 3 · Accepted Answer · 2018-04-05

As long as you have received the entire run folder from your sequence provider you do not need to do anything with individual files/folders. Just leave the folder structure intact without moving any files. cellranger software knows this folder structure and will work with it seamlessly.

There is a samplesheet generator program provided by 10x on the page I had linked. Once you use it you will end up with a file that looks like this (you can make one up yourself based on this example). Needs to be .csv format. The reads section may need to be changed depending on the number of cycles run. This information will be in the RunInfo.xml file you have seen in the Illumina folder. Tell us if Read 1 =/= 26 and Read 2 =/= 98 in RunInfo.xml file.

[Header]
EMFileVersion,4

[Reads]
26
98

[Data]
Lane,Sample_ID,Sample_Name,index,Sample_Project
1,SI-GA-A1_1,Sample_1,GGTTTACT,Chromium_20180405
1,SI-GA-A1_2,Sample_1,CTAAACGG,Chromium_20180405
1,SI-GA-A1_3,Sample_1,TCGGCGTC,Chromium_20180405
1,SI-GA-A1_4,Sample_1,AACCGTAA,Chromium_20180405
1,SI-GA-B1_1,Sample_2,GTAATCTT,Chromium_20180405
1,SI-GA-B1_2,Sample_2,TCCGGAAG,Chromium_20180405
1,SI-GA-B1_3,Sample_2,AGTTCGGC,Chromium_20180405
1,SI-GA-B1_4,Sample_2,CAGCATCA,Chromium_20180405

You do need to know Sample_ID and the associated 10x index code. The code will look like this: SI-GA-B1 (internally it is a mix of 4 index sequence so you will see 4 entries for each sample for those, see example above). You also need to know which kit was used to make 10x libraries. There is more than one kit available. You can get this information from whoever made the libraries.

You will need to have Illumina bcl2fastq v.2.20 software installed and available in $PATH before running cellranger. Once you have the samplesheet ready use cellranger to demultiplex the data. That command would look something like this:

cellranger mkfastq --id=my_id \
                     --run=/path/to/illumina_data_folder \
                     --csv=samplesheet.csv

After the run there will be a folder created inside the Illumina data folder (or a different location that you specify using --output-dir option in addition to others above) with name my_id (substitute as needed). Demultiplexed data will be in there. If you are planning to use a cluster there is additional configuration needed for cellranger. But more on that later.