How can I count the number of UMIs and reads at each position after mapping with a reference sequence?
1
0
Entering edit mode
3.8 years ago
naeem40thju ▴ 10

Hi, I have collected my HTS data (single-end) of E.coli ribosome (full) using the Illumina platform. I found UMI-tools is very interesting and useful. I have used 18nt random barcode at 5'-end for avoiding the read duplication. I want to count the number of UMIs and reads at each position after mapping with a reference sequence. I have read the manual of UMI-tools, but couldn't figure out the solution: can you please suggest me how can I proceed. I'm providing an example showing what is my aim and how much I have understood:

Say, I have extracted the random barcode (18nt) from the 5'- end of each reads at the head ('_' seperated) like below using UMI-tools. Then I'll do mapping with the reference sequence using bowtie -2 . Now, I want to count the number of reads at each position of the reference and the barcodes which were unique to those reads from the SAM/BAM file. That means, I want to get the number of molecules at each position and their UMIs. For example, if I get 100 reads at 15th position and those 100 reads contained 75 types of unique barcodes, e.g., I want to get the number of reads (100) and unique barcodes (75) at each position (here 15th).

@ST-E00205:943:HCF3YCCX2:4:1101:11495:1678_CCAGCCCAAAGCCACCCG 1:N:0:NCCACGCG+NGATCTCG ACCGGATGGTAGACCTGGAGGAGGGGAAAGCCGAGGTGGTGACGGGAGCGGCTGGGGGGGGAGTCCGGGATGGTAGGCGGAGCGGGCAGAGCACAGCAGCTCGTGTAGAAATGG
+ 
7-<--7--7-7F-----77----7---7-------------------7----77-7-----7------7---------7-7------7--7----77----------77-7---
next-gen sequencing • 2.0k views
ADD COMMENT
1
Entering edit mode
3.8 years ago

After alignment, you should be able to deduplicate your BAM file using umi_tools dedup.

Can should then be able to use something like bedtools genomeCoverage -5 to calculate the mapping depth (and therefore the number of unique UMIs) at each position.

ADD COMMENT
0
Entering edit mode

Sorry, you need a -bga on that command as well.

(so bedtools genomecov -5 -bga) The output will then be:

  1. Chromsome (as in E. coli I'm guess the chromosome labeled "genome" is the main E. coli circular genome"
  2. Start Coordinate
  3. End Coordinate (always start coordinate + 1)
  4. Number of reads that start at this position.

because we have deduplicated the bam, so that only one copy of each read starting at a given location and with a particular UMI is kept, the column 4 - the number of reads that start at that position IS the number of UMIs at that position.

ADD REPLY
0
Entering edit mode

Thank you very much. I have understood. [Sorry, I have lost my password of naeem40thju and even couldn't retrieve it].

ADD REPLY
0
Entering edit mode

Hi, after deduplicating I used

bedtools genomecov -5 -bga -ibam xxxx_dedupe.bam

command and I found the output as follows: 1. Chromosome; 2. Start coordinate 3. End coordinate 4. Number of reads/UMIs.

That means at 39th position (BB) there is no UMIs and at 40th position, there are 6 UMIs and so on, isn't it? I am a little confused: In case of all chr (BB, AA, DA), in the beginning, up to 39/38th position there is no aligned reads/UMIs, how is it possible? I have checked the dedupe.bam file, there are aligned reads in 1-37/38th positions. What's wrong here actually? Thanks.

BB  0   39  0
BB  39  40  6
BB  40  41  3
BB  41  42  11
...........................
...........................
AA  0   38  0
AA  38  39  16
AA  39  40  10
AA  40  41  44
............................
...........................
DA  0   38  0
DA  38  39  29
DA  39  40  3
DA  40  41  8
............................
............................
ADD REPLY
0
Entering edit mode

Thank you very much for your reply. I had taken 25K reads as a sample run. The bedtools genomecov -5 command gives the output as follows: According to bedtools manual:

1st column is chromosome (in case of me, BB: 5S, AA: 16S, DA: 23 ribosomal subunits). I am not sure what does genome means at the bottom. 2n column: depth of coverage (why 0?) 3rd column: number of bases on chromosome

Actually, I wanted to get the number of total UMIs and aligned reads at each position. I apologies, if my enquiry is very ordinary- I'm totally new in analysing high-throughput sequencing data.

BB      0       117     120     0.975
BB      1       2       120     0.0166667
BB      3       1       120     0.00833333
AA      0       1356    1538    0.881665
AA      1       138     1538    0.0897269
AA      2       30      1538    0.0195059
AA      3       7       1538    0.00455137
AA      4       1       1538    0.000650195
AA      5       2       1538    0.00130039
AA      6       1       1538    0.000650195
AA      12      1       1538    0.000650195
AA      15      1       1538    0.000650195
AA      234     1       1538    0.000650195
DA      0       2605    2914    0.89396
DA      1       226     2914    0.0775566
DA      2       61      2914    0.0209334
DA      3       12      2914    0.00411805
DA      4       3       2914    0.00102951
DA      5       1       2914    0.000343171
DA      6       1       2914    0.000343171
DA      7       1       2914    0.000343171
DA      29      1       2914    0.000343171
DA      50      1       2914    0.000343171
DA      56      1       2914    0.000343171
DA      63      1       2914    0.000343171
genome  0       4078    4572    0.891951
genome  1       366     4572    0.0800525
genome  2       91      4572    0.0199038
genome  3       20      4572    0.00437445
genome  4       4       4572    0.000874891
genome  5       3       4572    0.000656168
genome  6       2       4572    0.000437445
genome  7       1       4572    0.000218723
genome  12      1       4572    0.000218723
ADD REPLY

Login before adding your answer.

Traffic: 2442 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6