Proper fastq pre-processing for extraction of count matrix
2
0
Entering edit mode
2.1 years ago

I downloaded some fastq files from SRA for analysis. I aligned them to my genome of interest using STAR, but now I'm not sure if my process was correct. My goal is to get a count matrix and analyze the data in Seurat. I used STAR's quantMode to get a count matrix, but the output only has 3 columns, i.e. three cells. This is obviously not the desired outcome. Now I'm wondering if I needed to split the fastqs into individual files for each cell/sample, or if the cell information is retained in the alignment bam file and I can use featureCount or htseq or something else to create a correct count matrix. Edit: This is Fluidigm C1 data

RNA-Seq sequence genome next-gen • 1.2k views
0
Entering edit mode

What single-cell protocol does it use?

0
Entering edit mode

They used Fluidigm C1

1
Entering edit mode

Hm, I'm unfamiliar with the output of that protocol. You may want to find some papers that utilize it and see how they generate their counts.

1
Entering edit mode

You may need to use one of these scripts to do the analysis of Fluidigm data.

0
Entering edit mode

I think you're right. There is a "C1 mRNA Seq HT Demultiplex Script" available here https://www.fluidigm.com/software that I believe will do what I need

0
Entering edit mode

david.f.stein : Please edit your original post and add in bold letters that this is Fluidigm data. It will have to be reprocessed properly before you can start doing counts.

1
Entering edit mode
2.1 years ago
newbio17 ▴ 350

Assuming the data is scRNA-Seq, you might want to check out one of previous workshops by Broad Institute (2019). It covers most of the basics for processing scRNA-Seq data including some downstream analyses you might be interested in.

0
Entering edit mode
2.1 years ago

You can't just throw a single cell fastq into STAR and expect it to figure out what all the cells are. It wasn't designed for that. It thinks you have bulk RNAseq data. The three columns are the counts depending on whether your protocol is stranded, stranded and running forward, stranded and running reverse

0
Entering edit mode

Thanks for your feedback, as I think I intimated in the question, I suspected I was doing something wrong. Do you have any suggestion on how to do this properly?

1
Entering edit mode

If you use STAR then use its module STARsolo ( https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md ) which is intended for single-cell work, but I cannot any further since I did not work with Fluidigm data so far. Don't the authors provide a count matrix at GEO? This is typically the case.

0
Entering edit mode

I didn't know count matrices were usually provided on GEO. This is the GEO accession https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE106269. Is "Series Matrix File(s)" the count matrix?

1
Entering edit mode

No, it should be any of the files when you click Custom next to GSE106269_RAW.tar at the very bottom. This section typically contains uploaded data such as count matrices. Check the paper whether they tell what exactly these uploaded files are.