Question: Splitting a fasta file on the basis of header barcodes
gravatar for eoin
3.0 years ago by
eoin30 wrote:

Hi folks,

Having a bit of a brain fart, I'm sure there's a very simple solution to this: I have a fasta file containing reads from 48 different samples, and containing a barcode in the header line:

>10_13 M01383:135:000000000-A7LW3:1:1101:16875:1408 1:N:0:1 orig_bc=GTACATACCGGT new_bc=GTACATACCGGT bc_diffs=0

I'm trying to split this into three separate files based on this particular experiment, lets say f1.fa, f2.fa, and f3.fa. I have a list of all the barcodes and the sample each relates to.

I've been playing with awk but to no avail, is there either a bit of code for this or a useful tool ?



rna-seq dna-seq fasta • 1.9k views
ADD COMMENTlink modified 3.0 years ago by glihm620 • written 3.0 years ago by eoin30

BBmap almost does that, it may do if you modify your header.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by h.mon30k would need a modified header to work in suffix mode (where it would automatically create one file per barcode without providing a list of barcodes, which is convenient when you don't know the barcodes). But if you do know the barcodes, you can list all 48 of them and run it in substring mode, like this: in=samples.fa out=out_%.fa substringmode names=GTACATACCGGT,AAAAAAAAAAA

For reference - if the header is has standard Illumina headers that end with the barcode, you generate one output file per barcode like this: in=all.fq.gz delimiter=: suffixmode out=%.fq.gz

That works for reads named like this:

@A00178:23:H2Y3GDMXX:1:1101:1344:1000 1:N:0:CGTACTAG+CTAAGCCT

...and would create a file named "CGTACTAG+CTAAGCCT.fq.gz".

ADD REPLYlink modified 2.9 years ago • written 3.0 years ago by Brian Bushnell17k

Rolling something in Biopython wouldn't be too painful. You should have a look at the tutorial and cookbook. Feel free to ask for help if you get stuck (but show us the code and what goes wrong). For sure there are also multiple other solutions.

ADD REPLYlink written 3.0 years ago by WouterDeCoster44k
gravatar for Pierre Lindenbaum
3.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

assuming you want the new_barcode and i'ts always at the same place:

awk -F '[ =]' '/^>/{f=sprintf("%s.fa",$7);} { print $0 >> f;}' input.fa
ADD COMMENTlink written 3.0 years ago by Pierre Lindenbaum129k
gravatar for glihm
3.0 years ago by
glihm620 wrote:

FASTX Barcode Splitter: sounds like your solution ! ;)

Documentation and Download.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by glihm620
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1559 users visited in the last hour