Question

Python: extract barcodes and vector sequence from fastq reads, add barcodes to header

1

Entering edit mode

7.6 years ago

tinker • 0

I have paired 4 line fastq reads of the usual form (example):

@SEQ_ID/read1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCATTAACTCACAGTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65CC


@SEQ_ID/read2
TTCTGAACGTTCTCGTAGCTAGCTTATCGCGATAGCTAGACGATCGCGATCGATGCGGCTAG
+
))**5!%%CC65%)''*((((***+))%%CCCC%++)(%.1***-+*''5CCF>>>>C*''5

The first 8 bases of each read is a random barcode (GATTTGGG in @SEQ_ID/read1) while the next 4 bases are a vector constant (GTTC).

I have been fumbling about trying to use Python to:

Remove the barcode and vector piece from the sequence line.
Remove the corresponding first 12 quality letters in line 4
Add the 8 bases to the header.

Hence @SEQ_ID/read1 would become something like:

@SEQ_ID/read1_GATTTGGG
AAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCATTAACTCACAGTT
+
))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65CC

Any ideas ?

next-gen python unix • 5.9k views

ADD COMMENT • link updated 7.6 years ago by chen ★ 2.5k • written 7.6 years ago by tinker • 0

0

Entering edit mode

have a look at Efficiently Iterating Over Fastq Records From Python

ADD REPLY • link 7.6 years ago by Medhat 9.7k

0

Entering edit mode

yeah I did look in that - though it only gave me a vaque glimpse of where to start,

ADD REPLY • link 7.6 years ago by tinker • 0

score 1 · Answer 1 · 2016-10-04

Using python

with open("your_file.fastq", "r") as data_in, open("your_output.fastq", "w") as data_out:
     for line in data_in;
          if line.startswith("@SEQ_ID/read1"):
                 sequence = data_in.next()[12:] # get sequence from the 12 position till end
                 sign = data_in.next()
                 quality = data_in.next()
          else:
                sequence = data_in.next()
                sign = data_in.next()
                quality = data_in.next()
          data_out.write("{}\n{}\n{}\n{}\n\n".format(line, sequence, sign, quality))

using FastqGeneralIterator

with open("your_file.fastq", "rU") as handle,  open("your_output.fastq", "w") as data_out:
    for (title, sequence, quality) in FastqGeneralIterator(handle):
          if title.startswith("@SEQ_ID/read1"):
                data_out.write("{}\n{}\n+\n{}\n\n".format(title, sequence[12:], quality))
          else:
                data_out.write("{}\n{}\n+\n{}\n\n".format(title, sequence, quality))

score 1 · Answer 2 · 2016-10-04

1

Entering edit mode

7.6 years ago

chen ★ 2.5k

Since the barcode is CAGTA, I guess you are following duplex sequencing protocols.

A tool called AfterQC (https://github.com/OpenGene/AfterQC) can automatically extract the barcode, and shift it to the query name. The only thing you need to do, is add "barcode" in the filenames to indicate they are from barcoded sequencing.

For example:
xxx.barcode.R1.fq
xxx.barcode.R2.fq

AfterQC will locate the pattern CAGTA and do the others automatically.

Note that, since the DNA synthesis can have errors, the position of the pattern CAGTA usually varies (+1 or -1 bp). So all scripts, assuming that the pattern is fixed at some position, will not work perfectly.

Hope this will help.

ADD COMMENT • link 7.6 years ago by chen ★ 2.5k

0

Entering edit mode

I didn't know about AfterQC, will check it out. BTW the seqs were totally made up from a wiki page but funny though they matches duplex seq.