10x 3' library creates R1 and R2 fastq files with the same read length
1
1
Entering edit mode
21 months ago
tomas4482 ▴ 390

Let me show you an example: https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR16093385&display=metadata

This data contains two reads, R1 and R2. The read length of R1 and R2 are the same 150bp.

However, this experiment is performed following 10x 3'library protocol. In the method section, it described as below:

The scRNA-seq libraries were generated using the 10x Genomics Chromium Controller Instrument and Chromium Single Cell 30 V3 Reagent Kits (10x Genomics). Briefly, cells were concentrated to 1,000 cells/mL and approximately 8,000–10,000 cells were loaded into each channel to generate single-cell gel bead-in-emulsions (GEM), which resulted in the expected mRNA barcoding of 3,000–8,000 single cells for each sample. After the reverse transcription step, GEMs were broken and barcoded cDNA was purified and amplified. The amplified barcoded cDNA was fragmented, A-tailed, ligated with adaptors and index PCR amplified. The final libraries were quantified using a Qubit High Sensitivity DNA assay (Thermo Fisher Scientific) and the size distribution of these libraries was determined by a High Sensitivity DNA chip on a Bioanalyzer 2200 (Agilent). All libraries were then sequenced by an Illumina sequencer (Illumina) on a 150 bp paired-end run.

Generally, fastq files from 10x 3' library should be I1, R1 and R2. The R1 only contains UMI and barcode, hence the length of R1 is far less than R2. According to this paper, they generated the double strand cDNA, in which both strands have UMI and barcode (I think? ). It seems to be reasonable to generate two fastq files that have equal read length like a pair-end sequencing data.

When downloading such file either from SRA or ENA, I always get these two fastq. I think the index, UMI and barcode should be in the reads. But I don't know how to extract them and split the SRA or fastq file to the default format of 10x scRNA-seq fastq.

When looking up original data stored in AWS, the filename is not a normal format for 10x 3' library fastq. s3://sra-pub-src-10/SRR16093385/OC17-1_BKDL192531646-1a-AK1647_1.fq.gz.1 and s3://sra-pub-src-9/SRR16093385/OC17-1_BKDL192531646-1a-AK1647_2.fq.gz.1

BTW, the example I provided here is not the only case. I have found this issue in another dataset. It's so strange and confused.

10x fastq scRNA-seq • 6.1k views
ADD COMMENT
5
Entering edit mode
21 months ago
ATpoint 81k

Don't worry, this is very common. Most 10x libraries are sequenced on Novaseq and the standard read length for it is 2x150bp, no matter what you sequence. It is true that for 10x you only need a fraction of the full R1, but as said, if the run was 2x150bp (almost all of our 10x data are sequenced that way) then R1 is 150bp, and there is no harm other than that it occupies unnecessary disk space. CellRanger and other tools like salmon-alevin will use the first few bases only. I1 is only necessary for demultiplexing so once fastq files were created you never need that again, it is usually not uploaded to NCBI as it is useless. R1 has CB/UMI. Yes, both strands have the UMI as DNA is always double-stranded after the library preparation PCR but this is all nothing to worry about. If you feed this is standard preprocessing tools it will be handled correctly. Does that make sense to you?

ADD COMMENT
0
Entering edit mode

Very clear. Thank you very much. May I ask one more question? If I want to extract UMI and barcode from R1, and keep R2 insert reads in a separate fastq, is there any tool can do this? Is it possible to like "re-create" the R1 and R2 files? I know UMI_tools has such function to extract UMI and barcode sequence. But it seems to use the standard 10x R1.fastq as input.

ADD REPLY
0
Entering edit mode

Is it possible to like "re-create" the R1 and R2 files?

Links you provided above are for the original fastq data submitted to NCBI (which is normally under Data Access tab in Original format section). In some instances people also submit cellranger BAM files. You can then use bamtofastq utility provided by 10x to recreate the original fastq files. It is now included in cellranger package.

ADD REPLY
0
Entering edit mode

You're correct. In some instances the bam file is an option. Data stored at AWS cloud sometimes can be public accessible. But in many other cases, only _R1.fq and _R2.fq were provided. That was why I was confused. ATpoint has explained why I always get two 150bp reads. But I still don't know if there is any method to recreate those original fastq files. I don't use CellRanger because I'm trying to use these raw data for other special purpose instead of quantifying gene expression. I've read CellRanger manual. It seems that CellRanger does not mention any function to split these reads into I1 R1 and R2 from such data format archived by SRA.

Would you mind enlighten me if there is any tool can achieve this task? Thank you.

ADD REPLY
0
Entering edit mode

But I still don't know if there is any method to recreate those original fastq files.

Files you linked to are the original reads for one sample. I1 files, if they exist, are simply illumina index sequences. Sequence in that file is identical for every read in one sample.

ADD REPLY
0
Entering edit mode

I1 fastq actually is out of my concern. Many datasets in SRA do not provide I1 fastq either. What I need are the R1 (which only have UMI and barcode, length should be 26/28bp) and R2(which only have insert sequence, varied from 98~150bp) fastq files. The sequence of these two fastq files should have been included in the files downloaded from SRA (in which the read length of these two files are 150bp).

ADD REPLY
0
Entering edit mode

That is what ATPoint explained above. Tools like cellranger will automatically use parts of read that they need e.g. 26-28 bp from Read1 to get UMI/Cell barcodes.

These submitters sequenced the samples much longer than recommended/necessary both for Read 1 and Read 2 and submitted the sequences as is. You can manually trim the reads down if you want to make them recommended length.

ADD REPLY
0
Entering edit mode

hello , I have some question if you can help me plz

  • the adaptors are added into R1 or R2 ? -I know that cellranger perform the trimming but can I trim my reads using trimmomatic ?

thank you,

ADD REPLY
0
Entering edit mode

No you do not need trimmomatic since you will be following cellranger, alevin scRNA seq pipeline etc.

ADD REPLY
0
Entering edit mode

It's fine to be new to a field, but what is not fine is to be resistant against advise. Two experienced users told you already trimmomatic on scRNA seq data that you don't need trimming for your data so what is the point insisting on it? Use CellRanger with your 10x data as they advise in the CellRanger manual and be done with it.

ADD REPLY
0
Entering edit mode

Sorry, I got it NO trimmomatic thank you

ADD REPLY
0
Entering edit mode

Dear GenoMax,

Hello. I have carefully read your and ATpoint's answers. I am working on a similar project as Tomas4482, where I use SAHMI to annotate microbial information from single cell sequencing. However, SAHMI requires kraken2 to calculate k-mer values from sequencing sequences. As you suggested, I can use bamtofastq to obtain the official 10X Fastq files, where R1 contains only barcode and umi, and R2 contains only sequencing data. For the R1 and R2 fastq files that I downloaded from the internet with a sequencing length of 150, I want to extract the relevant information from them. As you mentioned, I need to do it manually. For barcode and umi, they are the first 26 or 28 bases of the Fastq1 file; For the Fastq2 file, how can I locate the 91 or 98bp in the Fastq2 file? I can only extract them if I know their positions. I would be very grateful if you could help me with this. Thank you!

ADD REPLY
1
Entering edit mode

For the Fastq2 file, how can I locate the 91 or 98bp in the Fastq2 file?

This is the structure of the 10x libraries: https://kb.10xgenomics.com/hc/en-us/articles/360035999892-What-is-the-structure-of-the-final-Visium-for-fresh-frozen-library-

If your software expects 91 or 98 bp then you can take the first 91/98 bp from fastq2 file. You can trim the data using bbduk.sh or any other trimming program to keep that many bases.

ADD REPLY
0
Entering edit mode

Thank you! I have written a script to filter the former 91/98 bp. I then ran cellranger both before and after trimming to examine the results, and the results were almost the same but with some minor differences. I followed ATpoint’s suggestion and posted a new issue here:Extract the true single-cell RNA sequencing reads for running SAHMI. Really thank you for your reply!

ADD REPLY
0
Entering edit mode

Please open a new question for this one.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Hi! Jusrt so that I understand this correctly, for the 10x v3 libraries sequenced on Novaseq, does it mean that if R1 is 150bp and looks like:

GNAACATGTTATAGCCTGGAATATCAGATTATTGTATATCATAAGTAGTCTCTATTTTTTTTTTTTTAAATATTTATGCTGTGTTTTCCCCGGGTGTAGTACAAATGTGCGAGATCGTCGAACCACCACCACCCCCACCTCGCGAGACTC

Then, we can just effectively ignore everything from bp 29 onwards? Am asking this in relation to this biostars post on STARsolo. Thank you!

ADD REPLY
2
Entering edit mode

Then, we can just effectively ignore everything from bp 29 onwards?

Yes.

ADD REPLY

Login before adding your answer.

Traffic: 2051 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6