Question: Modifying barcode sequence in fq files
1
gravatar for AP
4.3 years ago by
AP100
AP100 wrote:

Hello,

I have several .fq files containing 5bp inline barcodes at the beginning of each read such as (barcodes are between *) :

@gi|110640213|ref|NC_008253.1|_418_952_1:0:0_1:0:0_0/1
*CCAGG*CAGTGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCATCTGGTAGCGATGAT
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gi|110640213|ref|NC_008253.1|_31_476_0:0:0_0:0:0_1/1
*CAGAT*GGTTGGTGATTTTGGCGGGGGCAGAGAGGACGGTGGCCACCTGCCCCTGCCTGGCATTGCTTTCC
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gi|110640213|ref|NC_008253.1|_210_743_2:0:0_1:1:0_2/1
*CATTA*CCACCACCATCACCATTACCACAGGAAACGGTGCGGGCTGACGCGTACAGGAAACACCGAAAAAA
+
2222222222222222222222222222222222222222222222222222222222222222222222

I would like to modify these sequences in order to have the same for each read (here it would start by AAAAA):

@gi|110640213|ref|NC_008253.1|_418_952_1:0:0_1:0:0_0/1
*AAAAA*CAGTGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCATCTGGTAGCGATGAT
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gi|110640213|ref|NC_008253.1|_31_476_0:0:0_0:0:0_1/1
*AAAAA*GGTTGGTGATTTTGGCGGGGGCAGAGAGGACGGTGGCCACCTGCCCCTGCCTGGCATTGCTTTCC
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gi|110640213|ref|NC_008253.1|_210_743_2:0:0_1:1:0_2/1
*AAAAA*CCACCACCATCACCATTACCACAGGAAACGGTGCGGGCTGACGCGTACAGGAAACACCGAAAAAA
+
2222222222222222222222222222222222222222222222222222222222222222222222

I want to make sure that only the sequence at the beginning of the reads are modified and not throughout the read itself. The barcode sequence might be present within reads and I don't want to modify it.

Do you know any easy way to do this? Thanks!

fastq barcode • 939 views
ADD COMMENTlink modified 4.2 years ago • written 4.3 years ago by AP100
4
gravatar for Gabriel R.
4.3 years ago by
Gabriel R.2.7k
Danmarks Tekniske Universitet
Gabriel R.2.7k wrote:

I assume that the * are not part of the sequence and are just there to highlight them :-) Then use awk:

zcat [in fasta file]  |awk '{if(NR%4==2){print "AAAAA"substr($0,5)}else{print $0}}' |gzip > [output fasta].gz
ADD COMMENTlink written 4.3 years ago by Gabriel R.2.7k

Works like a charm! Thanks Gabriel. I was trying things with awk but I was not successful. This solves my issue. Also yes, the * are not part of the sequence :-)

ADD REPLYlink written 4.3 years ago by AP100

you are most welcome, mark the question as answered if you please :-)

ADD REPLYlink written 4.3 years ago by Gabriel R.2.7k
0
gravatar for AP
4.2 years ago by
AP100
AP100 wrote:

From Gabriel R:

zcat [in fasta file] |awk '{if(NR%4==2){print "AAAAA"substr($0,5)}else{print $0}}' |gzip > [output fasta].gz

ADD COMMENTlink written 4.2 years ago by AP100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1545 users visited in the last hour