I have paired 4 line fastq reads of the usual form (example):
@SEQ_ID/read1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCATTAACTCACAGTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65CC
@SEQ_ID/read2
TTCTGAACGTTCTCGTAGCTAGCTTATCGCGATAGCTAGACGATCGCGATCGATGCGGCTAG
+
))**5!%%CC65%)''*((((***+))%%CCCC%++)(%.1***-+*''5CCF>>>>C*''5
The first 8 bases of each read is a random barcode (GATTTGGG in @SEQ_ID/read1) while the next 4 bases are a vector constant (GTTC).
I have been fumbling about trying to use Python to:
- Remove the barcode and vector piece from the sequence line.
- Remove the corresponding first 12 quality letters in line 4
- Add the 8 bases to the header.
Hence @SEQ_ID/read1 would become something like:
@SEQ_ID/read1_GATTTGGG
AAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCATTAACTCACAGTT
+
))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65CC
Any ideas ?
have a look at Efficiently Iterating Over Fastq Records From Python
yeah I did look in that - though it only gave me a vaque glimpse of where to start,