First post to the forum, sorry I'm new at this:
I inherited a BED file which has 32-length barcodes.
1) The first 8 digits are the cell's unique code
2) The next 8 match an entry from one of three sets (either 7a, 7b or 7c)
3) The next 8 digits are also a part of the cells unique code
4) The last 8 digits match one of two sets (5a or 5b)
We have six technical replicates that are defined as some combination of a 7x and a 5x code.
I did a lot of analysis with this data in Signac/Seurat this summer; I was given a filtered Seurat object to work with. Now, I'm trying to re-do the QC (most importantly, re-do the doublet analysis), and I should input the data as a separate BED file for each of the six technical replicates. This is my first time having access to the original BED file.
I have a lot of the barcodes in my Seurat object but because of filtering it doesn't have all of them. I have a simple script that sorts the barcodes into one of the six technical replicates, but I don't have a list of all the barcodes in the BED file (I'm starting to realize this is usually given separately, but I didn't inherit it).
So I guess my question is whether there is a way to do this filtering entirely in terminal (easily) and/or if there is an easy way to extract all the unique barcodes from a BAM or BED file. [After extracting the barcodes, I know how to make 6 whitelists (.csv) and then split using sinto].
My data looks like this:
chr1 10126 10175 ATTCAGAAAAGAGGCAACGTCCTGTCTTACGC 1 chr1 10150 10180 ATTACTCGAACCAGGTAGGATAGGCTCCTTAC 1 chr1 10151 10192 GAGATTCCCGTATAGAAGGATAGGACTCTAGG 1
Oh, is it space delimited rather than tab delimited? I also forgot to throw a sort in there. Can you try:
cut -f4 -d" " your_file.bed | sort | uniq > output.txt
uniq -cwill also give the counts for each barcode if that's of interest.
Thanks again, I think its tab delimited because your command re-produced the same BED file. Should I run:
to indicate tab delimitated?
cutexpects tab delimited by default and your example above works for me when tab delimited. I'd check the file to try to figure out what's going on.
Worked perfectly! Thanks-
Shouldn't all of these barcodes be unique then?
cut -f4 your_file.bed | uniq > output.bedwill get all the unique barcodes if there truly are repeats, but there really shouldn't be unless I'm missing something.
Thanks, I'll give that a try. I think the BED file has a row for each barcode-chromosome position pair. So if I have 10,000 cells and each has an average of 20,000 unique features then doesn't the bed file have 10,000 times 20,000 rows but only 10,000 unique barcodes?
This code made 90 million rows of output but there are only 40,000 cells. Is there a way to tell it that I want unique barcodes (f4) and not unique rows? Thanks again -