2
0
Entering edit mode
23 months ago

Hello,

It was my understanding that reads from 10x scRNA-seq experiments were always paired-end. I have been downloading a lot of such data and so far I have always found this to be the case. But I found something confusing in this sample:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3534656

It is a 10x sample, and yet there is only one read, no R2 at all. Am I looking at this incorrectly? If I am not, how could one analyze this sample with the missing read?

Thanks so much!!

10x scRNA-seq reads SRA • 988 views
1
Entering edit mode
23 months ago
GenoMax 121k

Have you checked the BAM file? We have been through this in your last thread. There is clearly a discrepancy with what NCBI seems to be doing with the sequence files for 10x but the BAM files should allow you to get the right set of files.

Edit: You could try dumping the reads using --split-files following what @ATPoint said but at this point my preference would be to get the original BAM. Use bam2fastq from 10x and be sure that you get the right data back.

0
Entering edit mode

Oh wow, I totally missed that BAM file. I am sorry about that. Yes, I will use that, thanks a lot! I am sorry I am repeating myself. I am re-analyzing a large amount of 10x datasets, I have done over 200 so far. I had never seen these discrepancies in any of them, until I found over 3-4 like this in a row, which got me so confused. If you don't mind me asking, what do you think it is actually happening? Do you thinks this is the author's fault? I have never upload raw data to GEO so I don't know how much flexibility/chances to mess up you have when you do that. Thanks again!!

0
Entering edit mode

BAM file for the sample looks OK.

@NB501047:189:HLHT3BGX5:3:21411:1860:4184 1:N:0:0
ACTTACTCAATCGGTTACATCACCAAN
+
AAAAAEEEEEEEEEEEEEEEEEEEEE#

@NB501047:189:HLHT3BGX5:3:21411:1860:4184 3:N:0:0
CAAAGGCCCGGTGGAAAGGACACGGGAAGTCTGGGCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAAGGCTGACGGCAAGTTAACGAAAAGAAAAATGGTGAATG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEE//EEEAE6EEEEEEEA6EEEAEEEE/EE/AEEEEEE<AEEEAA</AEEEEAAEEAAE<

0
Entering edit mode

I was able to analyse it correctly, thanks so much! I certainly missed the BAM file and that was my bad. My question was more towards understanding what is going on? I am new to this field, and I don't know exactly why the discrepancy you mention can even occur, specially given that I know it doesn't occur with most datasets. Do you have some notion on who could have possibly made a mistake here? Was it the nature of the data? The authors? NCBI?

0
Entering edit mode

Do you thinks this is the author's fault?

Who knows. 10x data is not standard Illumina (in a sense) so problem may either be with the submitters or NCBI. At least NCBI is making the original BAM available. You could email NCBI help desk with accession numbers for records you found to not have proper data and ask them.

0
Entering edit mode

Thanks so much genomax, this is so useful. I appreciate you taking the time to teach newbies like me!!!

0
Entering edit mode
23 months ago
ATpoint 65k

The first read (R1) is a technical read which contains the Cellular Barcode (CB) and the Unique Molecular Identifier (UMI). R2 then reads the actual cDNA. Still, for sake of running most tools you need R1 and it should be uploaded to GEP. The 120bp here are an uncommon read length ( https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8377703 ) maybe they merged R1 and R2 and uploaded this to SRA, not sure. Would roughly equal the 28bp for R1 and 91bp for R2 which is the 10X recommendation for V3 libraries.

0
Entering edit mode

That makes sense, thanks so much! But jeez, why would they do this???