Question

Extract the true single-cell RNA sequencing reads for running SAHMI

0

Entering edit mode

8 months ago

573704669 • 0

Hello everyone! I’m currently using the SAHMI pipeline to annotate microbiome information from the single-cell data. However, I encountered two potential problems when applying SAHMI to 10X scRNA data:

The first one is that SAHMI (SAHMI)inputs both paired reads to annotate microbiome by using kraken2, which calculates k-mer of assigned sequencing reads. However, since only the R2 read contains true sequence and R1 read only contains barcode and umi information, I don’t think it is appropriate to calculate k-mer by using both reads. I then modified the script to use only the R2 read for k-mer calculation, but also use the R1 read for extracting barcode and umi information.

The second one is that, according to 10X instructions, there are 28 bp for barcode and UMI of R1 and 91 bp for true sequence of R2 in 3’ v3, and 26 bp of R1 and 98 bp of R2 in 3’ v2 (10X instruments). However, most R1 and R2 fastq files I obtained from SRA have a length of 150 bp, which also include other bases such as index information in addition to the true reads. Therefore, I want to extract the true sequencing reads to make it more accurate for microbiome annotation.

I wrote a script to extract the first 91/98 bp from R2 fastq files. Here is an example from SRR19159061 (SRR19159061 link), which has a small file size and easy to be processed.

SRR19159061 has 28 bp in R1 fastq, so I only extracted the first 91 bp from R2 fastq. I then ran cellranger v7.1.0 to compare the results before and after trimming. The results are almost the same, but there are still some differences. Since these fastqs files have 4 inputs: I1, I2, R1, R2, I only replace the R2 as trimmed one to run cellranger again. So I would like to ask for your opinion on whether this is normal, and whether I can ignore this discrepancy. After all, I only need the true sequences to annotate microbiome, but I still need to input the original fastqs for cellranger to get gene expression. I sincerely appreciate any answers.

Thank you!

The former image is before trimming, and the later one is after trimming. enter image description here

scRNA • 951 views

ADD COMMENT • link 8 months ago by 573704669 • 0

score 1 · Answer 1 · 2023-08-14

1

Entering edit mode

8 months ago

GenoMax 141k

This looks fine. This submission seems to be using dual-index 10x kit. You may even be able to use the full 150 bp if you are not seeing the poly-A's in your reads which would indicate that reads are reading into the 10x adapter on 3'end.

ADD COMMENT • link 8 months ago by GenoMax 141k

0

Entering edit mode

Excuse me, could you please provide me with a more detailed explanation? Does that mean that if I don't find poly-A in the R2 reads, I can consider the entire 150 bp as the true sequence reads? What does it mean when you mentioned that 'reads are reading into the 10x adapter on the 3' end'? Thank you!

ADD REPLY • link 8 months ago by 573704669 • 0

0

Entering edit mode

If you see the library structure of 10x libraries you will see that R2 (which is on the bottom strand) will at some point start running into the adapter (labeled as PolydT(VN)). Ref: https://kb.10xgenomics.com/hc/en-us/articles/360035999892-What-is-the-structure-of-the-final-Visium-for-fresh-frozen-library-

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Thank you! I have searched for relevant instructions, and this is what I understood: the role of Poly (dT) VN is to link the captured RNA fragments to the primers of the sequencing platform, making them recognizable sequences by the sequencing platform. This way, scRNA sequencing can sequence all types and specific types of RNA in cells. Therefore, Poly (dT) VN may enable cellranger to more accurately distinguish UMI from different cells and different genes. However, the actual sequencing data is still the first 91/98bp bases. I think this also explains why when I used bamtofastq to convert 10X bam files to fastq format, the length of R2 was 130bp, because these bases still contained the Poly (dT) VN sequence. Therefore, it makes sense to use R2 containing Poly (dT) VN for cellranger; And if I want to extract the real sequencing bases, manually extracting the first 91/98bp is enough. May I ask if my understanding is correct?

ADD REPLY • link 8 months ago by 573704669 • 0

0

Entering edit mode

I am not sure why these submitter's sequenced 150 bp. Generally 10x recommends only 90 or 98 bp. Go with 91 or 98 bp. It should not change the end result of your analysis.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Hi, I understand. In fact, R2 reads include all the true sequence reads. Before sequencing, a library needs to be constructed first. During the reverse transcription of mRNA, regardless of the type of reverse transcriptase used, the length of the cDNA obtained should be much longer than 150 bp. 10X suggests amplifying 91 or 98 bases, but in many cases, the company uses the Illumina sequencing platform to run 2x150. Therefore, all the bases in R2 are the true sequence information of cDNA. I also realize that I need to learn more basic knowledge of molecular biology. Of course, I will consider your suggestion and only use the first 91/98 bp recommended by 10X. Thank you very much for your reply!

ADD REPLY • link 8 months ago by 573704669 • 0