Question

Combining demultiplexed files based on identical basename and different "paired" barcodes

0

Entering edit mode

9 months ago

snowpin • 0

I have over 800,000 fastq.gz files after demultiplexing and am trying to combine them based on barcodes (BCs) and basenames. Below is an example of my data. Each file has a basename (sample#) and a BC1 (BC_#)

sample1_BC1_1_R1.fastq.gz
sample1_BC1_49_R1.fastq.gz
sample1_BC1_2_R1.fastq.gz
sample1_BC1_50_R1.fastq.gz

sample2_BC1_1_R1.fastq.gz
sample2_BC1_49_R1.fastq.gz
sample2_BC1_2_R1.fastq.gz
sample2_BC1_50_R1.fastq.gz

I want to combine files that have the same basename and a specific set of BC1 identifiers so that the following BC1 identifiers would be combined. In other words, each sample received two different BC1s.

BC1_1 and BC1_49
BC1_2 and BC1_50 
BC1_3 and BC1_51 
...
48 and 96

For the example above with 8 files, my output would be 4 files...

sample1_BC1_1-49_R1.fastq.gz
sample1_BC1_2-50_R1.fastq.gz
sample2_BC1_1-49_R1.fastq.gz
sample2_BC1_2-50_R1.fastq.gz

How can I do this in linux or python? Or even R? Thank you in advance! I haven't quite reached high proficiency with linux or python yet, so any help is welcomed.

I have tried looping through files to identify files with similar basenames but am having trouble concatenating the files given they have the right BC1 identifiers.

barcode demultiplex • 580 views

ADD COMMENT • link updated 9 months ago by Ram 43k • written 9 months ago by snowpin • 0

0

Entering edit mode

Curious about how the data ended up in this format? Is this some kind of custom single cell data design? If it is one of the standard single-cell platforms then this may have been made more complicated than necessary.

ADD REPLY • link 9 months ago by GenoMax 142k

0

Entering edit mode

Hi,

Yes, this is a custom single-cell protocol where I had to demultiplex myself. And yes, the BC1 given to all samples and is combined with the numbers I mentioned above so that it should be...

BC1_1 and BC1_49
BC1_2 and BC1_50
... and so forth.

I edited my original post. Hopefully that helps provide more insight into how I can solve this issue!

ADD REPLY • link 9 months ago by snowpin • 0

score 0 · Answer 1 · 2023-07-23

0

Entering edit mode

9 months ago

GenoMax 142k

Does this look right for one set of numbers? So BC1 does not change for all 800K files?

$ for i in `ls -1 *_1_*gz`; do name=$(basename ${i} _BC1_1_R1.fastq.gz); echo "cat ${name}_BC1_1_R1.fastq.gz ${name}_BC1_49_R1.fastq.gz > ${name}_BC1_1-49_R1.fastq.gz"; done
cat sample1_BC1_1_R1.fastq.gz sample1_BC1_49_R1.fastq.gz > sample1_BC1_1-49_R1.fastq.gz
cat sample2_BC1_1_R1.fastq.gz sample2_BC1_49_R1.fastq.gz > sample2_BC1_1-49_R1.fastq.gz

ADD COMMENT • link 9 months ago by GenoMax 142k

0

Entering edit mode

Yes, this is perfect and exactly what I've been trying to do for hours! Many thanks!

Yes, the "BC1" (those three characters only) are constant for all ~800k files while the number following it varies depending on what mix of barcodes the sample received. The barcodes are part of a primer set and since the primers do not have high capture efficiency in isolation, I am trying a combination of the two.

ADD REPLY • link 9 months ago by snowpin • 0