Question

umi_tools count omitting majority of reads

0

Entering edit mode

4.2 years ago

paulranum11 ▴ 80

I am trying to use UMI tools to create a genes x counts matrix for a single-cell RNA-Seq dataset. This can be done using the umi_tools count command.

Below are two (sam formatted) lines from my input file. Read assignment status is denoted using the XS flag and gene ID is denoted using the XT flag.

A00303:172:HKJVWDRXX:2:2265:15555:10614_AGCATTCGAGATCGCAAATCCGTCATCCAAGATCGCAGTGGCC_CGATCGGGAA  16  1   4912891 3   109M42S *   0   0   TGACTGTCCTGGAACTCACTCTGTAGACCAGGCTGGCCTCGAATTCAGAAATCCACCTGCCTCTGCCTCCCAAGTGCTGGGATTAAAGGCATGTGCCACCACTGTCCGGTGAAACTGGGAGTTTTAACCAACTCCACTTGCTCTACTGGGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:2  HI:i:2  AS:i:99 nM:i:4  XS:Z:Assigned   XN:i:1  XT:Z:Rgs20
A00303:172:HKJVWDRXX:1:2110:16034:12633_AGCATTCGTATCAGCAGGAGAACAATCCAACAGCAGAGTGGCC_CACAATTGGC  272 1   5267364 0   28S121M *   0   0   GCCAGAGCATTCGTATCAGCATTTTTTTTTTTTTGTGTTTAGGAAATTGTATCTTAGATCTTGGGTATCTTAGGTTTTGGGCTAATATCCACTTATCAGTGAGTACATATTGTGTGAGTTCCTTTGTGAATGTGTTACCTCACTCAGGA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFF:FFFFFFFF,FFFF,FFFFFFFFFFFFFFFFFFFFFFF:F,FFFFF:FFF:FFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:   NH:i:7  HI:i:4  AS:i:113    nM:i:3  XS:Z:Unassigned_NoFeatures

I know the dataset has the following numbers of assigned and unassigned reads.

Total reads: 8449032 Assigned reads: 7446773 Unassigned reads: 619281

To count this dataset i am using the following command: umi_tools count --wide-format-cell-counts --per-gene --gene-tag=XT --assigned-status-tag=XS --per-cell -I assigned_sorted.bam -S counts.tsv.gz

However i get the following output with only ~30,000 reads tallied.

INFO Input Reads: 8449032

INFO Read skipped, no tag: 1002259

INFO Number of reads counted: 30971

Does anyone know how i can improve the percentage of assigned reads that are tallied using umi_tools count?

RNA-Seq sequencing • 1.7k views

ADD COMMENT • link 4.2 years ago by paulranum11 ▴ 80

0

Entering edit mode

When I add the --ignore-umi option i get a similarly low number of reads (22,727). I will try to extract the number of counts and unique UMIs to confirm that this is not the issue.

ADD REPLY • link 4.2 years ago by paulranum11 ▴ 80

0

Entering edit mode

When you --ignore-umi you will only keep one read per position. Because here your "positions" are genes, rather than co-ordinates, it will only keep one read per gene.

Looking at the number of unique UMIs will help, but not completely solve your problem, because UMI-tools will collapse UMIs that are different if certain criteria are met: 1) The two UMIs are different at fewer than (by default) 1 position and 2) The number of reads with UMI1 is more than twice that of UMI2.

The assigned reads bit look right to me: 8,449,031 total reads of which 7,446,773 are assigned, leaves approx 1 million unassigned. Which is what UMI-tools reports.

ADD REPLY • link 4.2 years ago by i.sudbery 19k

score 1 · Answer 1 · 2020-02-18

1

Entering edit mode

4.2 years ago

swbarnes2 14k

umi_tools is supposed to remove reads based on their UMI, are you sure that the program is not doing exactly what it should be doing?