Question: Downloading single cell data from NCBI
0
gravatar for V
8 weeks ago by
V100
UK/London
V100 wrote:

Hello,

I am trying to download a single cell RNAseq run from NCBI. It is SRR7898910 also found here Link

I try to download the data using the NCBI recommended way of downloading the SRA file using prefetch and the using the following command on the terminal:

`fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-2 --clip SRR7898910

My issue is that this creates a single fastq file - that is 16GB. So I assume it is all the sequences of all the cells in one fastq file. Does anyone know how to get 2 fastq files (forward and reverse) for each cell in the study? It seems like the logical first step as I want to realign them and analyse them in house. Or if that is not possible then to get a bam file for each cell for counting.

Thanks for any recommendations you may have. best wishes

ADD COMMENTlink written 8 weeks ago by V100

Are you sure there are two reads? NCBI record seems to indicate there is only one read. Of course that information has been wrong at times.

ENA seems to only offer BAM file submitted by the submitters. It appears to be a uBAM (unaligned BAM) You can download and convert that to fastq(s) by samtools fastq. Note: This BAM file is ~33G.

Edit: It looks like the BAM file has read groups RG. There is more than one lane of data. Not sure if there are paired-end reads.

K00135:310:HW73WBBXX:1:1127:22698:20111 16      1       3201566 255     115M    *       0       0       CTTTGTGTCTGTGTCTTTCATTTGCCTATGAAAAGAATGTTAGTTGGCTGTAACCATAAAATTGGCAGTTGTATTTACAAATAATCACATA
TATGATGCTTACAGAATGATGGGC        AJFFA<7-77JJJFJFFJJJJAFJAJAJJJJJJJJJJFJJAAJA-JFJJJJJFFJFAF<<AF<JJJJJJFFFFF-<JFFJJ<FJFJJJFJFF7JAAJJFJ<JJJJJJJJJFFFAA     NH:i:1  HI:i:1  AS:i:113        nM:i:0  RE:
A:I     BC:Z:CTCGCGTA   QT:Z:AAFAFJJJ   CR:Z:TGAAAGATCCATGAAC   CY:Z:AAFFFJJJJJJJJJJJ   CB:Z:TGAAAGATCCATGAAC-1 UR:Z:CCGTCCTATG UY:Z:JJJJJJJJJJ UB:Z:CCGTCCTATG RG:Z:7wo_B6N_males_Pool2_mm10v7b:Mi
ssingLibrary:1:HW73WBBXX:1
K00135:310:HW73WBBXX:2:2110:8907:7257   16      1       3201567 255     115M    *       0       0       TTTGTGTCTGTGTCTTTCATTTGCCTATGAAAAGAATGTTAGTTGGCTGTAACCATAAAATTGGCAGTTGTATTTACAAATAATCACATAT
ATGATGCTTACAGAATGATGGGCT        AA7F<A77-A-FFFA-F<F7--AA<AJ<<FJJA<JF-FF<JF7JAF<FFAA-7F<FF-FAJ<<J<FFA<AJF<JJJJJJJJJA<<FF<<AFA-FAAF<-7FAJJJF7777FF<<A     NH:i:1  HI:i:1  AS:i:113        nM:i:0  RE:
A:I     BC:Z:AAATGTGC   QT:Z:AAAA-F-F   CR:Z:TGCTACCAGTACGATA   CY:Z:AAFFFJJJJFJJJFJF   CB:Z:TGCTACCAGTACGATA-1 UR:Z:GACCCTCAGG UY:Z:FFJJJJJJFJ UB:Z:GACCCTCAGG RG:Z:7wo_B6N_males_Pool2_mm10v7b:Mi
ssingLibrary:1:HW73WBBXX:2
ADD REPLYlink modified 6 weeks ago • written 8 weeks ago by genomax62k

10xGenomics data means that one of the pairs has only cell barcode and UMI on it. The bam you excerpted above has that information encoded in it in the BC and CR tags. The bam above is aligned. It's got chromosome and position entries.

ADD REPLYlink modified 6 weeks ago • written 8 weeks ago by swbarnes24.8k

Thank you for your help, I will have a look at downloading the uBam file and see what the samtools fastq function does to it.

ADD REPLYlink written 8 weeks ago by V100

Hello,

I seem to not have the fastq command in my samtools for some reason even though I'm at the latest version. I've ran samtools bam2fq on the bam file and it's only generated 1 fastq file, shouldn't I have two per cell?

ADD REPLYlink written 6 weeks ago by V100

bam2fq is not going to be smart enough to understand that the information in the tags should be combined to make a fastq. You are going to have to parse the fastq yourself if you really want to separate that information out. Are you totally sure that you want to go back to the original fastqs?

ADD REPLYlink written 6 weeks ago by swbarnes24.8k

Realistically I dont mind either way, I want to do is realign the data to mm10 and generate a count file that has every cell in the experiment so I can pass it into Seurat etc.

Would you happen to know of a way around this? My biggest issue is why are all the cells smushed into 1 file it looks like the most unhelpful way you can deposit data!

ADD REPLYlink written 6 weeks ago by V100

You can try contacting the authors to see if they would be willing to give you data in a different format. Otherwise you are going to have to parse the BC/CR tags to split the bam file. Are you sure the aligned data is not already using mm10?

ADD REPLYlink written 6 weeks ago by genomax62k

You have cell barcodes. You have Sam flags, there should be bam tags indicating Gene ids. You can parse the file you have.

ADD REPLYlink written 6 weeks ago by swbarnes24.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1480 users visited in the last hour