Question

demultiplexing with fastq but without barcode read fastq

0

Entering edit mode

7.5 years ago

tonja.r ▴ 600

It seems that I am missing something, so I will just describe my problem. I have paired-end illumina reads in fastq format. In .txt I have the sequence for forward and reverse primers and tags for each experiment. I will attach an example file. The read has following format: tag-primer-fragment I need to demultiplex the reads according to the experiment and get rid of the adapters, primers, experiment sequences. There are two scripts that could do that in QIIME:

split_libraries_fastq.py - but I do not have The barcode read fastq files

demultiplex_fasta.py - it operates only on fasta format but I do not want to loose the quality information as in further I might want to filter according to the quality.

Is there any other way I could demultiplex without losing quality information?

next-gen sequencing • 3.1k views

ADD COMMENT • link updated 7.5 years ago by charbo24 ▴ 40 • written 7.5 years ago by tonja.r ▴ 600

2

Entering edit mode

If the tag was before the sequencing primer that would not be captured in the reads (unless I am missing something here). Perhaps primer in your schema is something other than sequencing primer? Are you able to see the tags at the beginning of the reads?

If the construct is logically correct (and you do have tags visible in the reads) then this thread may help: Count and location of strings in fastq file reads

ADD REPLY • link 7.5 years ago by GenoMax 141k

score 2 · Accepted Answer · 2016-10-13

Are you using barcode and tag interchangeably? So you have reads that are:

unknownbarcodesequence-amplificationprimer-fragment

If so, STACKS has a de-multiplexing script that will do what you want, but it needs a list of barcodes. Whoever did your library preps should have that list and what experiment each one belonged to.

If that metadata is gone forever, you should still be able to recover the list of barcodes with a BASH script:

Pulling out everything from ^ to the primer sequence in your reads (awk should work for this)
sort/uniquify the list ( sort | uniq )

That will give you every unique barcode, which will almost certianly be more barcodes than you used, because some will be sequencing errors. Any barcodes with only 1 or 2 reads associated with them are probably just errors and can be discarded, what's left is your likely barcode list.

Of course, that won't help you associate the barcodes with your experimental variable, but if I understand your question correctly, that is probably unrecoverable.