Question

Why does this 10x scRNA-seq data set have one read instead of two reads?

0

Entering edit mode

3.5 years ago

pomodoro_sinensis ▴ 110

Hello,

It was my understanding that reads from 10x scRNA-seq experiments were always paired-end. I have been downloading a lot of such data and so far I have always found this to be the case. But I found something confusing in this sample:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3534656

It is a 10x sample, and yet there is only one read, no R2 at all. Am I looking at this incorrectly? If I am not, how could one analyze this sample with the missing read?

Thanks so much!!

10x scRNA-seq reads SRA • 1.9k views

ADD COMMENT • link updated 3.5 years ago by GenoMax 141k • written 3.5 years ago by pomodoro_sinensis ▴ 110

score 1 · Answer 1 · 2020-11-03

1

Entering edit mode

3.5 years ago

GenoMax 141k

Have you checked the BAM file? We have been through this in your last thread. There is clearly a discrepancy with what NCBI seems to be doing with the sequence files for 10x but the BAM files should allow you to get the right set of files.

Edit: You could try dumping the reads using --split-files following what @ATPoint said but at this point my preference would be to get the original BAM. Use bam2fastq from 10x and be sure that you get the right data back.

ADD COMMENT • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

Oh wow, I totally missed that BAM file. I am sorry about that. Yes, I will use that, thanks a lot! I am sorry I am repeating myself. I am re-analyzing a large amount of 10x datasets, I have done over 200 so far. I had never seen these discrepancies in any of them, until I found over 3-4 like this in a row, which got me so confused. If you don't mind me asking, what do you think it is actually happening? Do you thinks this is the author's fault? I have never upload raw data to GEO so I don't know how much flexibility/chances to mess up you have when you do that. Thanks again!!

ADD REPLY • link 3.5 years ago by pomodoro_sinensis ▴ 110

0

Entering edit mode

BAM file for the sample looks OK.

@NB501047:189:HLHT3BGX5:3:21411:1860:4184 1:N:0:0
ACTTACTCAATCGGTTACATCACCAAN
+
AAAAAEEEEEEEEEEEEEEEEEEEEE#

@NB501047:189:HLHT3BGX5:3:21411:1860:4184 3:N:0:0
CAAAGGCCCGGTGGAAAGGACACGGGAAGTCTGGGCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAAGGCTGACGGCAAGTTAACGAAAAGAAAAATGGTGAATG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEE//EEEAE6EEEEEEEA6EEEAEEEE/EE/AEEEEEE<AEEEAA</AEEEEAAEEAAE<

ADD REPLY • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

I was able to analyse it correctly, thanks so much! I certainly missed the BAM file and that was my bad. My question was more towards understanding what is going on? I am new to this field, and I don't know exactly why the discrepancy you mention can even occur, specially given that I know it doesn't occur with most datasets. Do you have some notion on who could have possibly made a mistake here? Was it the nature of the data? The authors? NCBI?

ADD REPLY • link 3.5 years ago by pomodoro_sinensis ▴ 110

0

Entering edit mode

Do you thinks this is the author's fault?

Who knows. 10x data is not standard Illumina (in a sense) so problem may either be with the submitters or NCBI. At least NCBI is making the original BAM available. You could email NCBI help desk with accession numbers for records you found to not have proper data and ask them.

ADD REPLY • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

Thanks so much genomax, this is so useful. I appreciate you taking the time to teach newbies like me!!!

ADD REPLY • link 3.5 years ago by pomodoro_sinensis ▴ 110

score 0 · Answer 2 · 2020-11-03

0

Entering edit mode

3.5 years ago

ATpoint 81k

The first read (R1) is a technical read which contains the Cellular Barcode (CB) and the Unique Molecular Identifier (UMI). R2 then reads the actual cDNA. Still, for sake of running most tools you need R1 and it should be uploaded to GEP. The 120bp here are an uncommon read length ( https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8377703 ) maybe they merged R1 and R2 and uploaded this to SRA, not sure. Would roughly equal the 28bp for R1 and 91bp for R2 which is the 10X recommendation for V3 libraries.