Question: Split fastq according to barcodes
0
gravatar for Hughie
4 months ago by
Hughie10
Hughie10 wrote:

Hello, everyone:

I'm recently analyze my scRNA-seq data, the first step is to splitting fastq files according to my barcode file which looks like this:

sc1 AACGTGAT
sc2 AAACATCG
sc3 ATGCCTAA
sc4 AGTGGTCA
sc5 ACCACTGT
sc6 ACATTGGC
sc7 CAGATCTG
sc8 CATCAAGT
sc9 CGCTGATC
sc10    ACAAGCTA
sc11    CTGTAGCC
sc12    AACGCTTA

My data is pair end sequenced and the R1, R2 are like these (I trimmed some):
R2:

@ST-E00493:75:H33JKALXX:1:1101:10987:2206 2:N:0:ATACACAT    
AACGCTTAAGGGTAATTTTTTGTGTTATGTATTTTTTTTTTAGGGGAAAAGGCATTTTTGGT...
+
AAFFFFJJ<A7JF<JF----AA--A--7----AAFJ-F<-FF-<<F-<-AFFA-7A7A-A-<...

R1:

@ST-E00493:75:H33JKALXX:1:1101:10987:2206 1:N:0:ATACACAT
GTTGTGAAGGGGAGGCTGGAGAGGCTTCGTCTGCTAAGAGCATTGGCCGTTCTTCCACTGTT...
+
AAAFFFJ-<JJJJJJJJFJJJF7JFFJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJFFJJJ...

The barcode information is in the first 8bp of R2 (Here is AACGCTTA), so, I want to split the fastq file according to the barcode informations and pair the read_1 to read_2 by header info. But after I searched many programes or scripts I can't find a suitable solution:
fastq-multx

fastq-multx -B barcode_sequence -b -m 0 R2.fastq.gz R1.fastq.gz -o %_R1.fq -o %_R2.fq
The result is absolutely not what I want which only 7 lines head with its barcode.

fastx_barcode_splitter.pl It seems don't spport PE reads.

BBmap

I also wrote a python script, but it runs so slow.... , So, I wonder if somebody have good suggestions. Thanks in advance!

ADD COMMENTlink modified 4 months ago by finswimmer11k • written 4 months ago by Hughie10
1

Isn't the barcode the last part of the header, in your example ATACACAC and ATACACAT?

If so demuxbyname.sh should help.

fin swimmer

ADD REPLYlink written 4 months ago by finswimmer11k

No, it's in the head 8bp of read_2

ADD REPLYlink written 4 months ago by Hughie10
1

Index sequences are in both R1/R2 headers (see example you posted above). As suggested by @finswimmer demuxbyname.sh should indeed work in this case.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax65k

Thanks for your reply, genomax , but this protocol is modified, so the barcode is in the head 8bp of Read_2. At the very beginning when I recieved the data, I also thought the barcode is in the header of both reads, so weird a protocol :(

ADD REPLYlink written 4 months ago by Hughie10
1

use fastx tool kit: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage

ADD REPLYlink modified 4 months ago • written 4 months ago by cpad011211k

Thanks cpad0112 for your reply, I have tried this method before but it can only handle SE data, and I have to get a script for pairing read1 and read2.

ADD REPLYlink written 4 months ago by Hughie10

you may want to look at this python library ( I haven't used it): https://pypi.org/project/barcode-splitter/

ADD REPLYlink written 4 months ago by cpad011211k

I also wrote a python script, but it runs so slow

show us the code

ADD REPLYlink written 4 months ago by Pierre Lindenbaum118k

The first version I wrote is too slow, and I found a python script paired-end_barcode_splitter from github. It uses a perl script named "fastx_barcode_splitter.pl" from FASTX-toolkit to split my read_2 data according to barcode information and after that, fastq header was used to match the paired reads. But It's only litter faster and it takes about 10h to run my 40GB PE gzipped data. so, I wonder if anyone has a better solution, thanks a lot.

ADD REPLYlink modified 4 months ago • written 4 months ago by Hughie10
3
gravatar for finswimmer
4 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Here a unix commandline only version:

$ paste <(zcat input_1.fastq.gz|paste - - - -) <(zcat input_2.fastq.gz|paste - - - - )|awk -v FS="\t" -v OFS="\n" 'FNR==NR {samples[$2]=$1; next} {barcode = substr($6,0,8); if(samples[barcode]) { print $1,$2,$3,$4>>samples[barcode]"_1.fastq"; print $5,$6,$7,$8>>samples[barcode]"_2.fastq"}}' samples.txt -
  • First we put the 4 lines, that belong to one read into one line
  • Then we put these 4 columns per read from the read1 file side-by-side with the 4 columns of the read2 file. The result is, that we have 8 columns per read pairs which are piped to awk
  • awk first reads in the file with the barcodes and names
  • After that it have a look at the first 8 characters of column 6 (which is the sequence of R2). If this is a barcode of a sample, we print the first 4 column to a file for read1 and the last 4 columns for read2. The delimiter is now converted back to new line.

You may want to compress the fastq files afterwards:

$ parallel 'bgzip {}' ::: *.fastq

fin swimmer

ADD COMMENTlink written 4 months ago by finswimmer11k

Thanks a lot for your reply fin swimmer! This solution is so amazing, and it's what I'm looking for. I have tested it on my server and I will report the running time later. By the way, linux command is so powerful and also complicated which I have a long way to go...

ADD REPLYlink written 4 months ago by Hughie10

It's 10 times faster than my previous scripts and it only toke ~1h to split my 40G gzipped PE data! So, amazing! Thank you again :)

ADD REPLYlink written 4 months ago by Hughie10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1125 users visited in the last hour