Question

Splitting BED file by technical replicate or extract all unique barcodes

0

Entering edit mode

4.0 years ago

summer_researcher • 0

First post to the forum, sorry I'm new at this:

I inherited a BED file which has 32-length barcodes.

1) The first 8 digits are the cell's unique code

2) The next 8 match an entry from one of three sets (either 7a, 7b or 7c)

3) The next 8 digits are also a part of the cells unique code

4) The last 8 digits match one of two sets (5a or 5b)

We have six technical replicates that are defined as some combination of a 7x and a 5x code.

I did a lot of analysis with this data in Signac/Seurat this summer; I was given a filtered Seurat object to work with. Now, I'm trying to re-do the QC (most importantly, re-do the doublet analysis), and I should input the data as a separate BED file for each of the six technical replicates. This is my first time having access to the original BED file.

I have a lot of the barcodes in my Seurat object but because of filtering it doesn't have all of them. I have a simple script that sorts the barcodes into one of the six technical replicates, but I don't have a list of all the barcodes in the BED file (I'm starting to realize this is usually given separately, but I didn't inherit it).

So I guess my question is whether there is a way to do this filtering entirely in terminal (easily) and/or if there is an easy way to extract all the unique barcodes from a BAM or BED file. [After extracting the barcodes, I know how to make 6 whitelists (.csv) and then split using sinto].

My data looks like this:

chr1    10126   10175   ATTCAGAAAAGAGGCAACGTCCTGTCTTACGC    1
chr1    10150   10180   ATTACTCGAACCAGGTAGGATAGGCTCCTTAC    1
chr1    10151   10192   GAGATTCCCGTATAGAAGGATAGGACTCTAGG    1

BED • 1.4k views

ADD COMMENT • link updated 16 months ago by Ram 44k • written 4.0 years ago by summer_researcher • 0

1

Entering edit mode

Oh, is it space delimited rather than tab delimited? I also forgot to throw a sort in there. Can you try: cut -f4 -d" " your_file.bed | sort | uniq > output.txt

Using uniq -c will also give the counts for each barcode if that's of interest.

ADD REPLY • link 4.0 years ago by jared.andrews07 ★ 17k

0

Entering edit mode

Thanks again, I think its tab delimited because your command re-produced the same BED file. Should I run:

cut -f4 -d"\t" your_file.bed | sort | uniq > output.txt

to indicate tab delimitated?

ADD REPLY • link 4.0 years ago by summer_researcher • 0

0

Entering edit mode

cut expects tab delimited by default and your example above works for me when tab delimited. I'd check the file to try to figure out what's going on.

ADD REPLY • link 4.0 years ago by jared.andrews07 ★ 17k

0

Entering edit mode

Worked perfectly! Thanks-

ADD REPLY • link 4.0 years ago by summer_researcher • 0

0

Entering edit mode

Shouldn't all of these barcodes be unique then?

cut -f4 your_file.bed | uniq > output.bed will get all the unique barcodes if there truly are repeats, but there really shouldn't be unless I'm missing something.

ADD REPLY • link 4.0 years ago by jared.andrews07 ★ 17k

0

Entering edit mode

Thanks, I'll give that a try. I think the BED file has a row for each barcode-chromosome position pair. So if I have 10,000 cells and each has an average of 20,000 unique features then doesn't the bed file have 10,000 times 20,000 rows but only 10,000 unique barcodes?

ADD REPLY • link 4.0 years ago by summer_researcher • 0

0

Entering edit mode

This code made 90 million rows of output but there are only 40,000 cells. Is there a way to tell it that I want unique barcodes (f4) and not unique rows? Thanks again -

ADD REPLY • link 4.0 years ago by summer_researcher • 0