Question

Adding barcodes to forward and reverse reads after sequencing

2

Entering edit mode

9.2 years ago

cmogren ▴ 20

I am using the obitools pipeline from metabarcoding.org to do dietary analysis on Illumina pair-end sequenced samples using the trnL gene. This is my first time doing a sequencing project and I hadn't given thought to the bioinformatics aspect at the time of sequencing.

My data came back demultiplexed, with each of 384 samples having the barcode in the label for forward and reverse reads. In order to use the obitools program, however, I need to have the barcode still in the sequence.

E.g. my samples look like this in the fastq file:

@HWI-M03023:75:000000000-D0ENP:1:1101:15759:1851 1:N:0:CTCTCTACACTGCATA
GGGCAATCCTGAGCCAAATCCCTTTTTTGAAAAACAAGTGGTTCTCAAACTAGAACCCAAAGGAAAAGGATAGGTGCAGAGACTCAATGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCTCTCTACATCTCGTATGCCGTCTTCTGCTTGATAAAAAAAACCCCCCCCCCTTCTCTCTTCTTTTCTTCCTCTTCTTTCCCTCCTTTCTTTTTTCCTTCTCTCCCACCTTTTTTTCATTTTTCCT
+

But I need them to look like this:

@HWI-M03023:75:000000000-D0ENP:1:1101:15759:1851
CTCTCTACGGGCAATCCTGAGCCAAATCCCTTTTTTGAAAAACAAGTGGTTCTCAAACTAGAACCCAAAGGAAAAGGATAGGTGCAGAGACTCAATGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCTCTCTACATCTCGTATGCCGTCTTCTGCTTGATAAAAAAAACCCCCCCCCCTTCTCTCTTCTTTTCTTCCTCTTCTTTCCCTCCTTTCTTTTTTCCTTCTCTCCCACCTTTTTTTCATTTTTCCTACTGCATA
+

Is there a program or script that can do this? Thanks!

obitools barcoding next-gen-sequencing • 4.7k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 9.2 years ago by cmogren ▴ 20

0

Entering edit mode

I'm surprised that they don't have a tool to process demultiplexed files. Even the non-demultiplexed ones will look like what you got rather than having the barcode where you apparently need it (it's sequenced after read1 and before read2).

ADD REPLY • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

It is entirely possible that they do, but being new to the program and new to bioinformatics, I'm just following the example they provide for applying the pipeline (http://metabarcoding.org/obitools/doc/wolves.html). And there are a couple places where my files unfortunately aren't formatted exactly like theirs are. I've worked through other issues, but this one is a head scratcher. Specifically I'm trying to apply the ngsfilter script, and that's when I noticed the barcoding issue. Could a sed command be used for this (though I'm scared to ask)?

ADD REPLY • link 9.2 years ago by cmogren ▴ 20

0

Entering edit mode

I'd use bioawk to:

1. Pick substr() of the $name to form new header

2. cut the last part of the $name and concat each part to the $seq

ADD REPLY • link 9.2 years ago by Ram 44k

0

Entering edit mode

That's probably the simplest method. It should be pointed out that an equivalent length of fake quality score should be added as well. The quality score for the barcodes do exist, though they'd be in a separate file (that was likely never produced, it'd be really unusual to do so).

ADD REPLY • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

Maybe use a 8-mer of random quality values twice? The substr() is crazy too, I'm thinking something like substr($name,len($name)-17,16) to pick the barcode and substr($name,0,len($name)-24) to be the new header.

And of course, the first string needs to be split into two. That escalated quickly. Where's the person that said bash was not much use again?

ADD REPLY • link 17 months ago by Ram 44k

0

Entering edit mode

I actually do have quality scores that are part of the fastq file - I just omitted them here to avoid clutter in the post! So the good news is no additional characters need to be added. Just need to delete the 1:N:0:barcode from the first line, and add the appropriate barcode to the beginning of the read 1 and read 2 files.

ADD REPLY • link 9.2 years ago by cmogren ▴ 20

0

Entering edit mode

Hi there. How do I replace the 4th line of my quality scores in the fastq to fake quality scores? Many thanks!

ADD REPLY • link 4.7 years ago by zach ▴ 10

0

Entering edit mode

You can use plain awk or bioawk. With plain awk, you can replace all lines where NR % 4 == 0 with the fake qual score value. It would be simple to use the same fake qual score value across all reads. If you wish to use different fake qual scores, things get a little more complicated.

ADD REPLY • link 4.6 years ago by Ram 44k

0

Entering edit mode

~~the answer is cut https://en.wikipedia.org/wiki/Cut_%28Unix%29~~

ADD REPLY • link 9.2 years ago by Pierre Lindenbaum 163k

0

Entering edit mode

that will need 2 variables to hold the two halves after a cut -d ":" and cut -c 8 on the result, and this has to happen within a loop. A quick script is better off, no?