I am using the obitools pipeline from metabarcoding.org to do dietary analysis on Illumina pair-end sequenced samples using the trnL gene. This is my first time doing a sequencing project and I hadn't given thought to the bioinformatics aspect at the time of sequencing.
My data came back demultiplexed, with each of 384 samples having the barcode in the label for forward and reverse reads. In order to use the obitools program, however, I need to have the barcode still in the sequence.
E.g. my samples look like this in the fastq file:
@HWI-M03023:75:000000000-D0ENP:1:1101:15759:1851 1:N:0:CTCTCTACACTGCATA
GGGCAATCCTGAGCCAAATCCCTTTTTTGAAAAACAAGTGGTTCTCAAACTAGAACCCAAAGGAAAAGGATAGGTGCAGAGACTCAATGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCTCTCTACATCTCGTATGCCGTCTTCTGCTTGATAAAAAAAACCCCCCCCCCTTCTCTCTTCTTTTCTTCCTCTTCTTTCCCTCCTTTCTTTTTTCCTTCTCTCCCACCTTTTTTTCATTTTTCCT
+
But I need them to look like this:
@HWI-M03023:75:000000000-D0ENP:1:1101:15759:1851
CTCTCTACGGGCAATCCTGAGCCAAATCCCTTTTTTGAAAAACAAGTGGTTCTCAAACTAGAACCCAAAGGAAAAGGATAGGTGCAGAGACTCAATGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCTCTCTACATCTCGTATGCCGTCTTCTGCTTGATAAAAAAAACCCCCCCCCCTTCTCTCTTCTTTTCTTCCTCTTCTTTCCCTCCTTTCTTTTTTCCTTCTCTCCCACCTTTTTTTCATTTTTCCTACTGCATA
+
Is there a program or script that can do this? Thanks!
I'm surprised that they don't have a tool to process demultiplexed files. Even the non-demultiplexed ones will look like what you got rather than having the barcode where you apparently need it (it's sequenced after read1 and before read2).
It is entirely possible that they do, but being new to the program and new to bioinformatics, I'm just following the example they provide for applying the pipeline (http://metabarcoding.org/obitools/doc/wolves.html). And there are a couple places where my files unfortunately aren't formatted exactly like theirs are. I've worked through other issues, but this one is a head scratcher. Specifically I'm trying to apply the ngsfilter script, and that's when I noticed the barcoding issue. Could a sed command be used for this (though I'm scared to ask)?
I'd use bioawk to:
1. Pick substr() of the $name to form new header
2. cut the last part of the $name and concat each part to the $seq
That's probably the simplest method. It should be pointed out that an equivalent length of fake quality score should be added as well. The quality score for the barcodes do exist, though they'd be in a separate file (that was likely never produced, it'd be really unusual to do so).
Maybe use a 8-mer of random quality values twice? The
substr()
is crazy too, I'm thinking something likesubstr($name,len($name)-17,16)
to pick the barcode andsubstr($name,0,len($name)-24)
to be the new header.And of course, the first string needs to be split into two. That escalated quickly. Where's the person that said bash was not much use again?
I actually do have quality scores that are part of the fastq file - I just omitted them here to avoid clutter in the post! So the good news is no additional characters need to be added. Just need to delete the 1:N:0:barcode from the first line, and add the appropriate barcode to the beginning of the read 1 and read 2 files.
Hi there. How do I replace the 4th line of my quality scores in the fastq to fake quality scores? Many thanks!
You can use plain awk or bioawk. With plain awk, you can replace all lines where
NR % 4 == 0
with the fake qual score value. It would be simple to use the same fake qual score value across all reads. If you wish to use different fake qual scores, things get a little more complicated.the answer is cut https://en.wikipedia.org/wiki/Cut_%28Unix%29that will need 2 variables to hold the two halves after a
cut -d ":"
andcut -c 8
on the result, and this has to happen within a loop. A quick script is better off, no?Ah, I was wrong , I read the question too quickly, sorry.