Let me show you an example: https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR16093385&display=metadata
This data contains two reads, R1 and R2. The read length of R1 and R2 are the same 150bp.
However, this experiment is performed following 10x 3'library protocol. In the method section, it described as below:
The scRNA-seq libraries were generated using the 10x Genomics Chromium Controller Instrument and Chromium Single Cell 30 V3 Reagent Kits (10x Genomics). Briefly, cells were concentrated to 1,000 cells/mL and approximately 8,000–10,000 cells were loaded into each channel to generate single-cell gel bead-in-emulsions (GEM), which resulted in the expected mRNA barcoding of 3,000–8,000 single cells for each sample. After the reverse transcription step, GEMs were broken and barcoded cDNA was purified and amplified. The amplified barcoded cDNA was fragmented, A-tailed, ligated with adaptors and index PCR amplified. The final libraries were quantified using a Qubit High Sensitivity DNA assay (Thermo Fisher Scientific) and the size distribution of these libraries was determined by a High Sensitivity DNA chip on a Bioanalyzer 2200 (Agilent). All libraries were then sequenced by an Illumina sequencer (Illumina) on a 150 bp paired-end run.
Generally, fastq files from 10x 3' library should be I1, R1 and R2. The R1 only contains UMI and barcode, hence the length of R1 is far less than R2. According to this paper, they generated the double strand cDNA, in which both strands have UMI and barcode (I think? ). It seems to be reasonable to generate two fastq files that have equal read length like a pair-end sequencing data.
When downloading such file either from SRA or ENA, I always get these two fastq. I think the index, UMI and barcode should be in the reads. But I don't know how to extract them and split the SRA or fastq file to the default format of 10x scRNA-seq fastq.
When looking up original data stored in AWS, the filename is not a normal format for 10x 3' library fastq.
BTW, the example I provided here is not the only case. I have found this issue in another dataset. It's so strange and confused.
Very clear. Thank you very much. May I ask one more question? If I want to extract UMI and barcode from R1, and keep R2 insert reads in a separate fastq, is there any tool can do this? Is it possible to like "re-create" the R1 and R2 files? I know
UMI_toolshas such function to extract UMI and barcode sequence. But it seems to use the standard 10x R1.fastq as input.
Links you provided above are for the original fastq data submitted to NCBI (which is normally under
Data Accesstab in
Original formatsection). In some instances people also submit
cellrangerBAM files. You can then use
bamtofastqutility provided by 10x to recreate the original fastq files. It is now included in
You're correct. In some instances the bam file is an option. Data stored at AWS cloud sometimes can be public accessible. But in many other cases, only _R1.fq and _R2.fq were provided. That was why I was confused. ATpoint has explained why I always get two 150bp reads. But I still don't know if there is any method to recreate those original fastq files. I don't use CellRanger because I'm trying to use these raw data for other special purpose instead of quantifying gene expression. I've read CellRanger manual. It seems that CellRanger does not mention any function to split these reads into I1 R1 and R2 from such data format archived by SRA.
Would you mind enlighten me if there is any tool can achieve this task? Thank you.
Files you linked to are the original reads for one sample. I1 files, if they exist, are simply illumina index sequences. Sequence in that file is identical for every read in one sample.
I1 fastq actually is out of my concern. Many datasets in SRA do not provide I1 fastq either. What I need are the
R1(which only have UMI and barcode, length should be 26/28bp) and
R2(which only have insert sequence, varied from 98~150bp) fastq files. The sequence of these two fastq files should have been included in the files downloaded from SRA (in which the read length of these two files are 150bp).
That is what ATPoint explained above. Tools like cellranger will automatically use parts of read that they need e.g. 26-28 bp from Read1 to get UMI/Cell barcodes.
These submitters sequenced the samples much longer than recommended/necessary both for Read 1 and Read 2 and submitted the sequences as is. You can manually trim the reads down if you want to make them recommended length.
hello , I have some question if you can help me plz
No you do not need trimmomatic since you will be following
alevinscRNA seq pipeline etc.
It's fine to be new to a field, but what is not fine is to be resistant against advise. Two experienced users told you already trimmomatic on scRNA seq data that you don't need trimming for your data so what is the point insisting on it? Use CellRanger with your 10x data as they advise in the CellRanger manual and be done with it.
Sorry, I got it NO trimmomatic thank you
Hi! Jusrt so that I understand this correctly, for the 10x v3 libraries sequenced on Novaseq, does it mean that if R1 is 150bp and looks like:
Then, we can just effectively ignore everything from bp 29 onwards? Am asking this in relation to this biostars post on STARsolo. Thank you!