from BAM to BED passing cell barcodes
Entering edit mode
2.3 years ago
BrunoGiotti ▴ 110

Hello all,

It seems that different flavors of this question have been already addressed yet i couldn't find this one in particular. I have been trying to understand how to covert BAM files (from 10x scRNA), which have been subset for a genomic locus of interest, to BED files with rows giving the position of each read and adding cell barcodes and UMI as additional columns (and possibly others like CIGAR codes). The reason to do that is to make it easier to subset / combine the BED files using multiple vectors of cell barcodes for downstream analyses. I figured that barcodes and UMI are attached to each BAM entry with tags but i can't understand how to pass them in the BED conversion with tools like bedtools bamtobed. It seems one way to go is to move them to the BAM headers but then i would still have to split them in the BED?

Any enlightenment is very much appreciated,


BAM BED scRNA barcode UMI • 1.7k views
Entering edit mode

I don't know if this is a good approach but you could try bamtobed from bedtools.

Entering edit mode
2.3 years ago

You can try quick and dirty approach, like using awk.

samtools view outs/possorted_bam.bam |head -2

D00733:486:CE5HCANXX:8:2109:16462:81597 1123    chr1    9998    0       50M     =       10029   81      CCATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC      3:3<AE;=@@GGB>0E@11E>F1EB0C1:@1::CDG1EGGGG1<<F>1<:      NM:i:1  MD:Z:1G48
MC:Z:50M        AS:i:48 XS:i:50 CR:Z:AAACGAACACACATGT   CY:Z:<33:@;/;/00//00@   CB:Z:AAACGAACACACATGT-1 BC:Z:AGGCTACC   QT:Z:3A30A1?F   GP:i:9997       MP:i:10078      MQ:i:0  RG:Z:High:MissingLibrary:1:CE5HCANXX:8
D00733:486:CE5HCANXX:8:1316:13949:92544 163     chr1    9998    0       50M     =       10016   68      CCATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC      BCBBBCGGGGBGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGEFGGGGB      NM:i:1  MD:Z:1G48
MC:Z:50M        AS:i:48 XS:i:50 CR:Z:CTCTACGGTTAGGAGC   CY:Z:<?BB>G=BCG<G>DB0   CB:Z:CTCTACGGTTAGGAGC-1 BC:Z:CTAGCTGT   QT:Z:BBC??FF@   GP:i:9997       MP:i:10065      MQ:i:0  RG:Z:High:MissingLibrary:1:CE5HCANXX:8-46FD2F59

So you can try to get the desired part (Assuming the CB tag stores the barcode information):

samtools view outs/possorted_bam.bam | head | awk -v OFS="\t" -F" " '{ print $3,$4,$4,$1";"$19}'

chr1    9998    9998    D00733:486:CE5HCANXX:8:2109:16462:81597;CB:Z:AAACGAACACACATGT-1
chr1    9998    9998    D00733:486:CE5HCANXX:8:1316:13949:92544;CB:Z:CTCTACGGTTAGGAGC-1
chr1    9998    9998    D00733:486:CE5HCANXX:8:1315:2269:7307;CB:Z:ACAGGCCTCGTCCCTA-1
chr1    9998    9998    D00733:486:CE5HCANXX:8:2206:20350:83100;CB:Z:GAAACAAGTACGGATG-1
chr1    9998    9998    D00733:486:CE5HCANXX:8:2104:3289:54151;CB:Z:GTAGACTTCAGAGTGG-1
chr1    9999    9999    D00733:486:CE5HCANXX:8:2211:2487:73034;CB:Z:TCCATCGCATACCCGG-1
chr1    9999    9999    D00733:486:CE5HCANXX:8:2201:15714:3416;CB:Z:TGCCTCAAGTCGAAAT-1
chr1    9999    9999    D00733:486:CE5HCANXX:8:2201:11407:27253;CB:Z:CCTGCTAGTAGCTGTT-1
chr1    9999    9999    D00733:486:CE5HCANXX:8:1105:15836:96636;CB:Z:AGCCAGCCACATTGCA-1
chr1    9999    9999    D00733:486:CE5HCANXX:8:2302:12853:92500;CB:Z:CGTTCCATCTGCGTCT-1

Last coulmn is readname;barcode.

A better approach is to look for CB:Z: tag using perl ( look for CB:Z: tag and print all the following characters until you hit a tab).

samtools view outs/possorted_bam.bam | head | 
perl -nle '@reads=split(/\t/,$_); if (m/CB:Z:([^\t\n]+)\t/) { print "$reads[2]\t","$reads[3]\t","$reads[3]\t","$reads[0]_",$1; }

chr1    9998    9998    D00733:486:CE5HCANXX:8:2109:16462:81597_AAACGAACACACATGT-1
chr1    9998    9998    D00733:486:CE5HCANXX:8:1316:13949:92544_CTCTACGGTTAGGAGC-1
chr1    9998    9998    D00733:486:CE5HCANXX:8:1315:2269:7307_ACAGGCCTCGTCCCTA-1
chr1    9998    9998    D00733:486:CE5HCANXX:8:2206:20350:83100_GAAACAAGTACGGATG-1
chr1    9998    9998    D00733:486:CE5HCANXX:8:2104:3289:54151_GTAGACTTCAGAGTGG-1
chr1    9999    9999    D00733:486:CE5HCANXX:8:2211:2487:73034_TCCATCGCATACCCGG-1
chr1    9999    9999    D00733:486:CE5HCANXX:8:2201:15714:3416_TGCCTCAAGTCGAAAT-1
chr1    9999    9999    D00733:486:CE5HCANXX:8:2201:11407:27253_CCTGCTAGTAGCTGTT-1
chr1    9999    9999    D00733:486:CE5HCANXX:8:1105:15836:96636_AGCCAGCCACATTGCA-1
chr1    9999    9999    D00733:486:CE5HCANXX:8:2302:12853:92500_CGTTCCATCTGCGTCT-1

If you want to have more control like extending read position based on strand or selecting reads with any specific features, you have to use pysam or something similar so you can print the desired output.

Entering edit mode

That's awesome! I'll look into pysam to try to add also CIGAR and strand then. Btw what is the awk function doing there?

Entering edit mode

If you want to add them for all reads just print additional appropriate fields ($6 for CIGAR).

awk is splitting the SAM record into individual fields and then printing some of them.

Entering edit mode

In facts, it seems that in some instances it maps some other tags, probably because there are missing entries. Is there an easy way out for this or i will have to look into pysam?

samtools view 2B1_ACE2.bam | awk -v OFS="\t" -F" " '{ print $3,$4,$4,$1";"$19}'
X   15351304    15351304    A00198:47:H5L7HDMXX:1:1177:27570:15483;CR:Z:TACAGTGCAGTGGGAT
X   15354190    15354190    A00198:47:H5L7HDMXX:2:2354:3242:22232;CR:Z:TTTATGCGTCAATACC
X   15390922    15390922    A00198:47:H5L7HDMXX:2:1260:15980:21778;CR:Z:CCTAAAGTCGGCCGAT
X   15420061    15420061    A00198:47:H5L7HDMXX:2:2230:12328:23594;CR:Z:TCGGTAATCTGCCAGG
X   15470781    15470781    A00198:47:H5L7HDMXX:2:1205:4915:5588;CR:Z:GAGTCCGTCTCAAACG
X   15473713    15473713    A00198:47:H5L7HDMXX:1:2129:27145:29590;CR:Z:TTGAACGTCCCTTGCA
X   15495326    15495326    A00198:47:H5L7HDMXX:2:2217:32434:7654;CR:Z:ACGCCAGGTCGAACAG
X   15495326    15495326    A00198:47:H5L7HDMXX:1:1463:5864:7545;CR:Z:CTGGTCTCAGGTCGTC
X   15495326    15495326    A00198:47:H5L7HDMXX:1:2358:23194:33771;CR:Z:GATGAAAGTCGAACAG
X   15495326    15495326    A00198:47:H5L7HDMXX:1:1319:4074:31532;CR:Z:TACAGTGAGCCGCCTA
X   15498847    15498847    A00198:47:H5L7HDMXX:1:2183:21920:10708;CR:Z:ACTTGTTCGCTTGTTG
X   15548279    15548279    A00198:47:H5L7HDMXX:1:2244:30626:11193;CR:Z:AGAGCTTGTTTGGCGC
X   15555411    15555411    A00198:47:H5L7HDMXX:2:1148:4390:31861;CR:Z:GTATTCTAGAGCTGGT
X   15558836    15558836    A00198:47:H5L7HDMXX:1:2175:26549:22044;CR:Z:CGGGTCATCTAACCGA
X   15561932    15561932    A00198:47:H5L7HDMXX:2:1232:17155:5932;RE:A:E
X   15564032    15564032    A00198:47:H5L7HDMXX:2:2210:7762:24799;RE:A:E
Entering edit mode

yes. Thats why you need pysam. let me show you another simpler solution. wait

Entering edit mode

I updated with another approach

Entering edit mode

Wait...why start and end position are identical? Aren't column 2 and 3 supposed to be start and end?

Entering edit mode

@geek_y is printing the same field $4 twice that is why the number is same. See the original answer for updated code. Try with your own files.

Entering edit mode

The new approach enter in a prompt without outputting anything. I'm not sure i have perl installed, Im working on a cluster..

Entering edit mode
2.3 years ago
BrunoGiotti ▴ 110

I found a way in converting BAM to BED-like format using GenomicAlignments package in R, which does exactly what I was looking for:

library (GenomicAlignments)
bam = BamFile ('test.bam') 
tag <- c("UB","CB") # get UMI and CELL BARCODE
param <- ScanBamParam (tag=tag)
bam_df = GenomicAlignments::readGAlignments ('test.bam', use.names=TRUE, param=param)
bam [1:3,]
GAlignments object with 3 alignments and 3 metadata columns:
                                         seqnames strand           cigar
                                            <Rle>  <Rle>     <character>
   A00198:47:H5L7HDMXX:1:2144:20853:7952        X      +   66M483416N24M
  A00198:47:H5L7HDMXX:1:2473:29649:13135        X      + 24M428249N64M2S
  A00198:47:H5L7HDMXX:2:2332:22571:31688        X      - 1S51M524470N38M
                                            qwidth     start       end
                                         <integer> <integer> <integer>
   A00198:47:H5L7HDMXX:1:2144:20853:7952        90  15148987  15632492
  A00198:47:H5L7HDMXX:1:2473:29649:13135        90  15196679  15625015
  A00198:47:H5L7HDMXX:2:2332:22571:31688        90  15209732  15734290
                                             width     njunc |    rname
                                         <integer> <integer> | <factor>
   A00198:47:H5L7HDMXX:1:2144:20853:7952    483506         1 |        X
  A00198:47:H5L7HDMXX:1:2473:29649:13135    428337         1 |        X
  A00198:47:H5L7HDMXX:2:2332:22571:31688    524559         1 |        X
                                                  UB                 CB
                                         <character>        <character>
   A00198:47:H5L7HDMXX:1:2144:20853:7952  CGTTGTACTG CTAACTTGTGCATCTA-1
  A00198:47:H5L7HDMXX:1:2473:29649:13135  ATGGTAATTG TTGACTTTCCGCAAGC-1
  A00198:47:H5L7HDMXX:2:2332:22571:31688  GTGTACGCCA AAGGAGCCATGCCTAA-1

Thanks for your feedbacks!


Login before adding your answer.

Traffic: 1388 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6