UMI deduplication and output
1
0
Entering edit mode
18 months ago
Elise • 0

Hello,

I have a single-cell protocol in which I have UMI (6nts long). I have individual fastq file for each individual cell sequenced single end 150bp using Nextera library prep. I am trying to extract the UMI information but have several questions.

1) When fragmentation occurs after PCR amplification, how it is possible to deduplicate the reads, since some reads will not be attached to the UMIs anymore, but would have initially come from a read with UMI? Are the reads with no UMI completely discarded from any analysis even though they may have been part of a read with a UMI initially?

2) Using regex I was able to extract UMI. I then trimmed, mapped, counted the number of genes using featureCounts and use umi_tools dedup to get information. From then it is not clear to me what the information actually means. Out of 4509 UMIs, 2182 needed to be deduplicated? How does this give information about my library?

Output from umi_tools extract:

INFO Input Reads: 31466543
INFO regex does not match read1: 29855051
INFO regex matches read1: 1611492
INFO Reads output: 1611492

Output from featureCounts:

Assigned    3868
Unassigned_Unmapped 221558

Output from umi_tools dedup:

INFO total_umis 4509
INFO #umis 1068
INFO Reads: Input Reads: 4509
INFO Number of reads out: 3212
INFO Total number of positions deduplicated: 2182
     Mean number of unique UMIs per position: 1.71
INFO Max. number of unique UMIs per position: 51

Thank you for your help

EDIT:

I am using the MATQ-Seq protocol. Sadly, the paper is not very clear with what is done with the UMIs in this protocol, neither the scripts are available.

I am also doing single cell bacteria in which having a low mapping rate is kind of normal. I usually get 10% of my total reads mapping to the genome.

UMI deduplication • 1.8k views
ADD COMMENT
0
Entering edit mode

Are you sure your UMI is still part of the read? I am asking because you mentioned that you have a 150bp single-end read, and your data is already split to cell-specific FastQs.

Several single-cell protocols I know (e.g. the CEL-seq2 protocol) use paired-end sequencing and one read of the pair contains exclusively the cell-barcode & UMI while the second read comprises the actual cDNA sequence. So to me, it seems likely that your data is already demultiplexed and deduplicated?

Are you aware of this dedicated umi-tools guide for single-cell data?

ADD REPLY
4
Entering edit mode
18 months ago

For the first part of your question:

Fragmentation after PCR is common in 3' end tagging protocols.

Here, a special primers containing an adaptor sequence, a UMI and a cell barcode and then a olgoDT stretch is used during reverse transcription. This means that the cDNA always contains the 3' end of the transcript, and each transcript RT'd has a UMI and cell barcode attached. PCR is then undertaken, followed by fragmentation, and ligation of an adaptor on the to 5' end of the fragment. Libraries are then prepared using the adaptor at the 3' end of the transcript (which is followed by the UMI) and the 5' adaptor. Paired end sequencing is undertaken, where one read will contain the UMI, plus mostly oligoDt sequece, and the other end contains the gene specific sequence. Note, here a unique 5' end position is not useful information: beecause fragementation happened after PCR, reads starting at the same coordinate can have arrisen from the same RNA molecule. But the UMI should be diagnostic of which RNA a read pair came from.

Note, this does not use the Nextera library prep.

The only single cell library preps I've seen using Nextera prep do not use a UMI.

For the second:

Its a little worring that you start with 31 million reads, but only find your UMI in 1.6 million of them (5% - but see above, you are right that if UMI is attached at RT, but the library is made with Nextera, most reads won't have UMIs).

Then it appears that after trimming and mapping you only have 225,000 reads, of which only 4000 are mapping assigned to genes (~2%).

The information out of dedup shows you that you had 4500 reads in total. After deduplication you had 3200 reads. As 2200 reads had to be deduplicated that means you had 2300 that didn't need to be deduplicated, and the remaining 2200 had between 2 and 51 UMIs per position and were collapsed into 900 reads.

This is just logging output though. The real output is in the BAM file output from the dedup. I don't see the commands in your post, but if you are doing fragmentation after PCR, the you need to use the --per-gene switch to dedup.

But the real problem here seems to be you start with 31 million reads, but only 4500 of them make it into the start of the dedup process.

ADD COMMENT
0
Entering edit mode

Thank you for your reply. It helps a lot.

I am using the MATQ-Seq protocol, coupled to Nextera library prep to look at single cell bacteria. In single cell bacteria RNA-Seq it is kind of normal to get a low mapping rate with the current technology. I usually get a maximum of 20% of my reads mapping (without UMI-deduplication). So wasn't expecting a very high mapping rate if on top of that I only look at UMI containing reads.

From the protocol, the UMI is on the 5' end of the read just after the template switching oligo (similar to Smart-Seq): ---adaptor--TSO--UMI--GGG--read-----

Theoretically I understand why paired-end sequencing would be necessary to collect the UMI. But in the library preparation after amplification, doesn't the UMI end up being in close proximity to both the p5 and the p7 adaptors depending on the read orientation, and so largely sequenced when in close proximity to the p5 adaptor, suggesting that single end sequencing would be sufficient?

5'-p5-TSO-UMI-----------p7-3' (UMI at begining of read in sequencing data)

5'-p7-OST-IMU-----------p5-3' (potential UMI in the read, if enough cycle or paired-end necessary)

ADD REPLY
0
Entering edit mode

Hi,

I'm not entirely sure if I'm following with the metric 'total number of positions deduplicated'. For instance, I have a case where the number of reads don't seem to sum up so nicely as in this explained example and I can't wrap my head around the reason.

INFO total_umis 5684016
INFO #umis 127091
INFO Searching for mates for 0 unmatched alignments
INFO 0 mates never found
INFO Reads: Input Reads: 5684016, Read pairs: 5684016, Read 2 unmapped: 291
INFO Number of reads out: 209472
INFO Total number of positions deduplicated: 135787
INFO Mean number of unique UMIs per position: 1.68
INFO Max. number of unique UMIs per position: 55
`

If I were to apply the logic above, there are 5 684 016 input reads, after dedup the total is 209 472. The total number of reads that needed to be deduplicated is 135 787, meaning 5 684 016 - 135 787 ~ 5 550 000 didn't have to be deduplicated. What happened to these 5 and half million other reads as we only have 209 472 reads after deduplication? What am I missing or misinterpreting?

ADD REPLY
1
Entering edit mode

Sorry, I think my reply above is a bit confusing.

135,787 isn't the number of reads that needed deduplicating, its the total number of positions - a position is expected to contain more than one read.

In your case there were 135,787 positions, and they had 288,122 (135787 x 1.68) UMI x position combinations between them, where were error corrected to 209,472 final output reads.

Given that you had 5684016 reads with a UMI, that suggests that on average you had on average 27 copies of each read.

ADD REPLY

Login before adding your answer.

Traffic: 1465 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6