Python: extract barcodes and vector sequence from fastq reads, add barcodes to header
2
1
Entering edit mode
7.6 years ago
tinker • 0

I have paired 4 line fastq reads of the usual form (example):

@SEQ_ID/read1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCATTAACTCACAGTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65CC


@SEQ_ID/read2
TTCTGAACGTTCTCGTAGCTAGCTTATCGCGATAGCTAGACGATCGCGATCGATGCGGCTAG
+
))**5!%%CC65%)''*((((***+))%%CCCC%++)(%.1***-+*''5CCF>>>>C*''5

The first 8 bases of each read is a random barcode (GATTTGGG in @SEQ_ID/read1) while the next 4 bases are a vector constant (GTTC).

I have been fumbling about trying to use Python to:

  • Remove the barcode and vector piece from the sequence line.
  • Remove the corresponding first 12 quality letters in line 4
  • Add the 8 bases to the header.

Hence @SEQ_ID/read1 would become something like:

@SEQ_ID/read1_GATTTGGG
AAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCATTAACTCACAGTT
+
))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65CC

Any ideas ?

next-gen python unix • 5.9k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

yeah I did look in that - though it only gave me a vaque glimpse of where to start,

ADD REPLY
1
Entering edit mode
7.6 years ago
Medhat 9.7k

Using python

with open("your_file.fastq", "r") as data_in, open("your_output.fastq", "w") as data_out:
     for line in data_in;
          if line.startswith("@SEQ_ID/read1"):
                 sequence = data_in.next()[12:] # get sequence from the 12 position till end
                 sign = data_in.next()
                 quality = data_in.next()
          else:
                sequence = data_in.next()
                sign = data_in.next()
                quality = data_in.next()
          data_out.write("{}\n{}\n{}\n{}\n\n".format(line, sequence, sign, quality))

using FastqGeneralIterator

with open("your_file.fastq", "rU") as handle,  open("your_output.fastq", "w") as data_out:
    for (title, sequence, quality) in FastqGeneralIterator(handle):
          if title.startswith("@SEQ_ID/read1"):
                data_out.write("{}\n{}\n+\n{}\n\n".format(title, sequence[12:], quality))
          else:
                data_out.write("{}\n{}\n+\n{}\n\n".format(title, sequence, quality))
ADD COMMENT
1
Entering edit mode
7.6 years ago
chen ★ 2.5k

Since the barcode is CAGTA, I guess you are following duplex sequencing protocols.

A tool called AfterQC (https://github.com/OpenGene/AfterQC) can automatically extract the barcode, and shift it to the query name. The only thing you need to do, is add "barcode" in the filenames to indicate they are from barcoded sequencing.

For example:
xxx.barcode.R1.fq
xxx.barcode.R2.fq

AfterQC will locate the pattern CAGTA and do the others automatically.

Note that, since the DNA synthesis can have errors, the position of the pattern CAGTA usually varies (+1 or -1 bp). So all scripts, assuming that the pattern is fixed at some position, will not work perfectly.

Hope this will help.

ADD COMMENT
0
Entering edit mode

I didn't know about AfterQC, will check it out. BTW the seqs were totally made up from a wiki page but funny though they matches duplex seq.

ADD REPLY
0
Entering edit mode

Ohh, I guess the guy made the wiki knew duplex seq well!

ADD REPLY

Login before adding your answer.

Traffic: 2162 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6