Question: Adding barcodes to forward and reverse reads after sequencing
2
gravatar for cmogren
4.6 years ago by
cmogren20
United States
cmogren20 wrote:

I am using the obitools pipeline from metabarcoding.org to do dietary analysis on Illumina pair-end sequenced samples using the trnL gene. This is my first time doing a sequencing project and I hadn't given thought to the bioinformatics aspect at the time of sequencing.

My data came back demultiplexed, with each of 384 samples having the barcode in the label for forward and reverse reads. In order to use the obitools program, however, I need to have the barcode still in the sequence.

E.g. my samples look like this in the fastq file:

@HWI-M03023:75:000000000-D0ENP:1:1101:15759:1851 1:N:0:CTCTCTACACTGCATA
GGGCAATCCTGAGCCAAATCCCTTTTTTGAAAAACAAGTGGTTCTCAAACTAGAACCCAAAGGAAAAGGATAGGTGCAGAGACTCAATGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCTCTCTACATCTCGTATGCCGTCTTCTGCTTGATAAAAAAAACCCCCCCCCCTTCTCTCTTCTTTTCTTCCTCTTCTTTCCCTCCTTTCTTTTTTCCTTCTCTCCCACCTTTTTTTCATTTTTCCT
+

But I need them to look like this:

@HWI-M03023:75:000000000-D0ENP:1:1101:15759:1851
CTCTCTACGGGCAATCCTGAGCCAAATCCCTTTTTTGAAAAACAAGTGGTTCTCAAACTAGAACCCAAAGGAAAAGGATAGGTGCAGAGACTCAATGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCTCTCTACATCTCGTATGCCGTCTTCTGCTTGATAAAAAAAACCCCCCCCCCTTCTCTCTTCTTTTCTTCCTCTTCTTTCCCTCCTTTCTTTTTTCCTTCTCTCCCACCTTTTTTTCATTTTTCCTACTGCATA
+

Is there a program or script that can do this? Thanks!

ADD COMMENTlink written 4.6 years ago by cmogren20

I'm surprised that they don't have a tool to process demultiplexed files. Even the non-demultiplexed ones will look like what you got rather than having the barcode where you apparently need it (it's sequenced after read1 and before read2).

ADD REPLYlink written 4.6 years ago by Devon Ryan93k

It is entirely possible that they do, but being new to the program and new to bioinformatics, I'm just following the example they provide for applying the pipeline (http://metabarcoding.org/obitools/doc/wolves.html). And there are a couple places where my files unfortunately aren't formatted exactly like theirs are. I've worked through other issues, but this one is a head scratcher. Specifically I'm trying to apply the ngsfilter script, and that's when I noticed the barcoding issue. Could a sed command be used for this (though I'm scared to ask)?

ADD REPLYlink written 4.6 years ago by cmogren20

I'd use bioawk to:

1. Pick substr() of the $name to form new header

2. cut the last part of the $name and concat each part to the $seq

ADD REPLYlink written 4.6 years ago by RamRS25k

That's probably the simplest method. It should be pointed out that an equivalent length of fake quality score should be added as well. The quality score for the barcodes do exist, though they'd be in a separate file (that was likely never produced, it'd be really unusual to do so).

ADD REPLYlink written 4.6 years ago by Devon Ryan93k

Maybe use a 8-mer of random quality values twice? The substr() is crazy too, I'm thinking something like substr($name,len($name)-17,16) to pick the barcode and substr($name,0,len($name)-24) to be the new header. 

And of course, the first string needs to be split into two. That escalated quickly. Where's the person that said bash was not much use again?

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by RamRS25k

I actually do have quality scores that are part of the fastq file - I just omitted them here to avoid clutter in the post! So the good news is no additional characters need to be added. Just need to delete the 1:N:0:barcode from the first line, and add the appropriate barcode to the beginning of the read 1 and read 2 files.

ADD REPLYlink written 4.6 years ago by cmogren20

Hi there. How do I replace the 4th line of my quality scores in the fastq to fake quality scores? Many thanks!

ADD REPLYlink written 5 days ago by zach0

You can use plain awk or bioawk. With plain awk, you can replace all lines where NR % 4 == 0 with the fake qual score value. It would be simple to use the same fake qual score value across all reads. If you wish to use different fake qual scores, things get a little more complicated.

ADD REPLYlink written 5 days ago by RamRS25k
the answer is cut https://en.wikipedia.org/wiki/Cut_%28Unix%29
ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Pierre Lindenbaum124k

that will need 2 variables to hold the two halves after a cut -d ":" and cut -c 8 on the result, and this has to happen within a loop. A quick script is better off, no?

ADD REPLYlink written 4.6 years ago by RamRS25k

Ah, I was wrong , I read the question too quickly, sorry.

ADD REPLYlink written 4.6 years ago by Pierre Lindenbaum124k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1149 users visited in the last hour