Simple FASTQ/A manipulation... how to add a single adapter sequence to 5' of all reads?
2
0
Entering edit mode
3.7 years ago
quickquark • 0

Hi everyone, and thanks in advance! I'm used to doing lots of trimming, substituting, etc on large FASTQ/A files, but now I need to add sequence arbitrarily at the beginning of all reads and I'm coming up short! Been searching a couple hours for a method via toolkit (fastx_toolkit, BBmap, etc.) or simple command (sed, awk, etc.).

So I'm looking to go from something like this:

>header
GTCTCAGATCGGAAGAGCACACGT
CCGGTCCTGGTTGCAGATCGGAAG
GTATCTCCTAAGATATAACAGGTTG
AGGTACAGGTTGGATGATAAGTCC


to this:

>header
AAAAAAGTCTCAGATCGGAAGAGCACACGT
AAAAAACCGGTCCTGGTTGCAGATCGGAAG
AAAAAAGTATCTCCTAAGATATAACAGGTTG
AAAAAAAGGTACAGGTTGGATGATAAGTCC


Alternatively, I can do the same with FASTQ files (also extending the quality lines to match), if there's already a tool out there for that. I'm not interested at quality at this point, as I've already merged paired-end reads with PandaSeq and filtered out anything but the highest quality reads.

FASTA FASTQ • 1.8k views
0
Entering edit mode

While you have been given possible solutions below, you would be breaking fastq format if you do not add corresponding scores on the quality line. Example you showed above is neither valid fasta or fastq format.

0
Entering edit mode

Ah yes, sorry, I should have been more accurate with that in case others come across this. I'll edit it to look like a real FASTA.

1
Entering edit mode

quickquark : Please test @Pierre's solution. It should work and if it does you should accept that too. You can accept more than one answer if they work.

3
Entering edit mode
3.7 years ago

sed will do that:

$sed 's|^[^@>]$$.*$$|AAAAAA\1|g' fastq.fq @header AAAAAATCTCAGATCGGAAGAGCACACGT @header AAAAAACGGTCCTGGTTGCAGATCGGAAG @header AAAAAATATCTCCTAAGATATAACAGGTTG @header AAAAAAGGTACAGGTTGGATGATAAGTCC$ sed 's|^[^@>]$$.*$$|AAAAAA\1|g' fasta.fa
AAAAAATCTCAGATCGGAAGAGCACACGT
AAAAAACGGTCCTGGTTGCAGATCGGAAG
AAAAAATATCTCCTAAGATATAACAGGTTG
AAAAAAGGTACAGGTTGGATGATAAGTCC


The first part between separators (|^[^@>]$$.*$$|) means match anything that does not start with @ or >, and capture the rest of the line in a group (parenthesis). The second part is the replacement, which means replace with AAAAAA followed by group 1 which was captured by the parenthesis.

Update: Added > to the non-matching character class part so it also works for FASTA files as well. See also comment below about FASTQ and multi-line FASTA files.

1
Entering edit mode

manuel.belmadani : You should update your solution to reflect the change OP made to the original question when you have a chance.

0
Entering edit mode

I added the > in the character class. Just be careful that your FASTA files don't have reads over multiple lines, or it'll break (and add AAAAAA at each non-header begining of line, even if multiple lines are part of the same contiguous reads.) This use case is a bit more complicated than the provided input in the original question. Same thing if you have a complete FASTQ file (e.g. with the quality score); then you'd have to avoid editing the quality header and the quality line. Something like what Pierre suggested would work to only edit every 2nd line: sed '2~4 s/^/AAAAAAA/' fastq.fq

2
Entering edit mode
3.7 years ago
sed '2~2 s/^/AAAAAAA/' input.txt