Question

Confusion on downloading SRA run data

1

Entering edit mode

22 months ago

crx6xw ▴ 10

Hello,

Please bear with me as I am relatively inexperienced to single-cell RNAseq data and downloading data from the SRA run selector.

I am interested in downloading mutant EGFR and mutant KRAS patient data from this article under this BioProject accession number PRJNA591860. Ideally, I hope to download these patients, convert them into Seurat objects, merge them based on condition (in this case mEGFR vs mKRAS), integrate the samples, and visualize DE between the two groups.

Now, from my understanding, because I'm not interested in the entire dataset (only interested in mutant EGFR/KRAS), I should individually download the read data and run it through the workflow to generate individual count matrices and so on... I found the raw reads on SRA under accession SRP238929. And based on the patient demographics the author uploaded, each sample name on the spread sheet should correspond to isolate number on the SRA download page.

Each isolate has around 50 files associated with its patient. So my question is what does this necessarily mean? Does this mean 50 reads associated with that patient? Do I need to download all of these files? If I'm demultiplexing with bclfastq2 and aligning with STAR2 how would that work?

Is there also an easier way of getting this data into the workflow (ie downloading count matrices themselves as opposed to remaking them, couldn't find it anywhere)?

Thanks.

scrnaseq runselector sra • 1.1k views

ADD COMMENT • link updated 22 months ago by vanessagpds ▴ 10 • written 22 months ago by crx6xw ▴ 10

0

Entering edit mode

Dear crx6xw,

I have the same problem with this dataset. Additionally, when trying to use cellranger to process the samples it presents a series of errors.

Have you tried to contact the authors to obtain the data already processed?

ADD REPLY • link 22 months ago by vanessagpds ▴ 10

0

Entering edit mode

Nope, I simply just didn't use this dataset. I think it's probably better to just download the count matrices directly instead of processing the raw reads first.

ADD REPLY • link 22 months ago by crx6xw ▴ 10

0

Entering edit mode

Hi, I hope that it can help you.

In the article (https://www.sciencedirect.com/science/article/pii/S0092867420308825?via%3Dihub) the authors told us that we could access all code used to generate the results of this study on GitHub at czbiohub/scell_lung_adenocarcinoma .

In this link, there is information: Clone the repo Download the Data_input folder from the link below into the repo: https://drive.google.com/drive/folders/1sDzO0WOD4rnGC7QfTKwdcQTx3L36PFwX?usp=sharing

In this drive, we can find a folder with the counts generated by a singlecell and metadata. The file is named S01_datafinal.csv (counts) and S01_metadata.csv (metadata) (link to access: https://drive.google.com/drive/folders/1VmPan5V19Hq--fMnFHOta67TIpO60srE).

Then you can filter only the sample of interesting using information available in the NCBI BioProject #PRJNA591860. For example, I need to use all the samples of patient TH226, then, I looked in SRA which the samples are for this patient. Later I filtered all the samples for this patient in the S01_datafinal.csv file using a script in R.

Let me know if you don't understand.

ADD REPLY • link 22 months ago by vanessagpds ▴ 10

score 1 · Answer 1 · 2022-06-21

Lots to unpack here. I'll address the last question.

Is there also an easier way of getting this data into the workflow (ie downloading count matrices themselves as opposed to remaking them, couldn't find it anywhere)

The short answer is: if you go to the GEO accession number provided in the paper, you will see the individual Samples listed. Once you click on a single sample, you will now see the possibility to download *exp.txt.gz files. According to the details under "data processing", those are "expression matrix file using RNA raw count". See this entry, for example: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4453576

The detail that's eluding me at the moment is how to map the demographic data to the GEO identifiers -- maybe reaching out to the authors might be the fastest and safest way to get that information.

Some context: SRA is the sequence read archive, which focuses on storing the actual raw reads, while GEO is one of several repositories that also places an emphasis on meta-data and processed data files. If you're looking for somewhat processed data, you will never find it at SRA, but GEO is usually a good bet.

In fact, looking at the description of the project, it seems unlikely that SRA will have much for that project, since "Submitter states that raw data are not available for this Series due to patient privacy concerns and human genetic resources policy in China." The "raw data" in this case seems to be BAM files (aligned reads), which would be silly, too, since those contain just as much sequence information as FASTQ files (unless they've been run through a privacy-protecting tool).