10x 3' library creates R1 and R2 fastq files with the same read length
1
0
Entering edit mode
5 weeks ago
tomas4482 ▴ 280

Let me show you an example: https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR16093385&display=metadata

This data contains two reads, R1 and R2. The read length of R1 and R2 are the same 150bp.

However, this experiment is performed following 10x 3'library protocol. In the method section, it described as below:

The scRNA-seq libraries were generated using the 10x Genomics Chromium Controller Instrument and Chromium Single Cell 30 V3 Reagent Kits (10x Genomics). Briefly, cells were concentrated to 1,000 cells/mL and approximately 8,000–10,000 cells were loaded into each channel to generate single-cell gel bead-in-emulsions (GEM), which resulted in the expected mRNA barcoding of 3,000–8,000 single cells for each sample. After the reverse transcription step, GEMs were broken and barcoded cDNA was purified and amplified. The amplified barcoded cDNA was fragmented, A-tailed, ligated with adaptors and index PCR amplified. The final libraries were quantified using a Qubit High Sensitivity DNA assay (Thermo Fisher Scientific) and the size distribution of these libraries was determined by a High Sensitivity DNA chip on a Bioanalyzer 2200 (Agilent). All libraries were then sequenced by an Illumina sequencer (Illumina) on a 150 bp paired-end run.

Generally, fastq files from 10x 3' library should be I1, R1 and R2. The R1 only contains UMI and barcode, hence the length of R1 is far less than R2. According to this paper, they generated the double strand cDNA, in which both strands have UMI and barcode (I think? ). It seems to be reasonable to generate two fastq files that have equal read length like a pair-end sequencing data.

When downloading such file either from SRA or ENA, I always get these two fastq. I think the index, UMI and barcode should be in the reads. But I don't know how to extract them and split the SRA or fastq file to the default format of 10x scRNA-seq fastq.

When looking up original data stored in AWS, the filename is not a normal format for 10x 3' library fastq. s3://sra-pub-src-10/SRR16093385/OC17-1_BKDL192531646-1a-AK1647_1.fq.gz.1 and s3://sra-pub-src-9/SRR16093385/OC17-1_BKDL192531646-1a-AK1647_2.fq.gz.1

BTW, the example I provided here is not the only case. I have found this issue in another dataset. It's so strange and confused.

10x fastq scRNA-seq • 322 views
2
Entering edit mode
5 weeks ago
ATpoint 64k

Don't worry, this is very common. Most 10x libraries are sequenced on Novaseq and the standard read length for it is 2x150bp, no matter what you sequence. It is true that for 10x you only need a fraction of the full R1, but as said, if the run was 2x150bp (almost all of our 10x data are sequenced that way) then R1 is 150bp, and there is no harm other than that it occupies unnecessary disk space. CellRanger and other tools like salmon-alevin will use the first few bases only. I1 is only necessary for demultiplexing so once fastq files were created you never need that again, it is usually not uploaded to NCBI as it is useless. R1 has CB/UMI. Yes, both strands have the UMI as DNA is always double-stranded after the library preparation PCR but this is all nothing to worry about. If you feed this is standard preprocessing tools it will be handled correctly. Does that make sense to you?

0
Entering edit mode

Very clear. Thank you very much. May I ask one more question? If I want to extract UMI and barcode from R1, and keep R2 insert reads in a separate fastq, is there any tool can do this? Is it possible to like "re-create" the R1 and R2 files? I know UMI_tools has such function to extract UMI and barcode sequence. But it seems to use the standard 10x R1.fastq as input.

0
Entering edit mode

Is it possible to like "re-create" the R1 and R2 files?

Links you provided above are for the original fastq data submitted to NCBI (which is normally under Data Access tab in Original format section). In some instances people also submit cellranger BAM files. You can then use bamtofastq utility provided by 10x to recreate the original fastq files. It is now included in cellranger package.

0
Entering edit mode

You're correct. In some instances the bam file is an option. Data stored at AWS cloud sometimes can be public accessible. But in many other cases, only _R1.fq and _R2.fq were provided. That was why I was confused. ATpoint has explained why I always get two 150bp reads. But I still don't know if there is any method to recreate those original fastq files. I don't use CellRanger because I'm trying to use these raw data for other special purpose instead of quantifying gene expression. I've read CellRanger manual. It seems that CellRanger does not mention any function to split these reads into I1 R1 and R2 from such data format archived by SRA.

Would you mind enlighten me if there is any tool can achieve this task? Thank you.

0
Entering edit mode

But I still don't know if there is any method to recreate those original fastq files.

Files you linked to are the original reads for one sample. I1 files, if they exist, are simply illumina index sequences. Sequence in that file is identical for every read in one sample.

0
Entering edit mode

I1 fastq actually is out of my concern. Many datasets in SRA do not provide I1 fastq either. What I need are the R1 (which only have UMI and barcode, length should be 26/28bp) and R2(which only have insert sequence, varied from 98~150bp) fastq files. The sequence of these two fastq files should have been included in the files downloaded from SRA (in which the read length of these two files are 150bp).

0
Entering edit mode

That is what ATPoint explained above. Tools like cellranger will automatically use parts of read that they need e.g. 26-28 bp from Read1 to get UMI/Cell barcodes.

These submitters sequenced the samples much longer than recommended/necessary both for Read 1 and Read 2 and submitted the sequences as is. You can manually trim the reads down if you want to make them recommended length.