Question: Splitting a fasta file on the basis of header barcodes
gravatar for eoin
20 months ago by
eoin30 wrote:

Hi folks,

Having a bit of a brain fart, I'm sure there's a very simple solution to this: I have a fasta file containing reads from 48 different samples, and containing a barcode in the header line:

>10_13 M01383:135:000000000-A7LW3:1:1101:16875:1408 1:N:0:1 orig_bc=GTACATACCGGT new_bc=GTACATACCGGT bc_diffs=0

I'm trying to split this into three separate files based on this particular experiment, lets say f1.fa, f2.fa, and f3.fa. I have a list of all the barcodes and the sample each relates to.

I've been playing with awk but to no avail, is there either a bit of code for this or a useful tool ?



rna-seq dna-seq fasta • 1.1k views
ADD COMMENTlink modified 20 months ago by glihm590 • written 20 months ago by eoin30

BBmap almost does that, it may do if you modify your header.

ADD REPLYlink modified 20 months ago • written 20 months ago by h.mon24k would need a modified header to work in suffix mode (where it would automatically create one file per barcode without providing a list of barcodes, which is convenient when you don't know the barcodes). But if you do know the barcodes, you can list all 48 of them and run it in substring mode, like this: in=samples.fa out=out_%.fa substringmode names=GTACATACCGGT,AAAAAAAAAAA

For reference - if the header is has standard Illumina headers that end with the barcode, you generate one output file per barcode like this: in=all.fq.gz delimiter=: suffixmode out=%.fq.gz

That works for reads named like this:

@A00178:23:H2Y3GDMXX:1:1101:1344:1000 1:N:0:CGTACTAG+CTAAGCCT

...and would create a file named "CGTACTAG+CTAAGCCT.fq.gz".

ADD REPLYlink modified 19 months ago • written 20 months ago by Brian Bushnell16k

Rolling something in Biopython wouldn't be too painful. You should have a look at the tutorial and cookbook. Feel free to ask for help if you get stuck (but show us the code and what goes wrong). For sure there are also multiple other solutions.

ADD REPLYlink written 20 months ago by WouterDeCoster37k
gravatar for Pierre Lindenbaum
20 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum117k wrote:

assuming you want the new_barcode and i'ts always at the same place:

awk -F '[ =]' '/^>/{f=sprintf("%s.fa",$7);} { print $0 >> f;}' input.fa
ADD COMMENTlink written 20 months ago by Pierre Lindenbaum117k
gravatar for glihm
20 months ago by
glihm590 wrote:

FASTX Barcode Splitter: sounds like your solution ! ;)

Documentation and Download.

ADD COMMENTlink modified 20 months ago • written 20 months ago by glihm590
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2116 users visited in the last hour