Question: demultiplexing with fastq having barcode and primer sequence
0
gravatar for tonja.r
3.4 years ago by
tonja.r470
UK
tonja.r470 wrote:

I have paired-end Illumina reads with barcode and primer sequence. Barcode and primer sequence are just in .txt file. The experiment was following: Primer was used for PCR and then they hanged the experiment tag (barcode) and the adapter. So, the read are following:

barcode_sequence-PCR_primer_sequence-fragment

I want to demultiplex the reads according to the barcode_sequence and then cut off the primer sequence. Till now I have tried following:

QIIME: split_libraries_fastq.py

I do not have the barcode read fastq files, I have only the sequences of barcode and primers. I contracted the mapping file:

#SampleID   BarcodeSequence LinkerPrimerSequence    Description
1   TCGCAGG AACCTGGTTGATCCTGCCAGT   C4363F2_18.7.   
2   CTCTGCA AACCTGGTTGATCCTGCCAGT   C4363F2_19.7.

So, I need to define -barcode_type not-barcoded. It showed me an error that I need to specify --sample ids, as I had only one input fiel, I have only one sample id

split_libraries_fastq.py  -m mapping.txt  -i Pool1_18S.fastq -o demultiplexed_output/ --barcode_type not-barcoded --sample_ids 1

I get one seqs.fna file where all reads have attached following:

orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0

Stacks: process_radtags

process_radtags -p /fastq -I -b /mapping_radtags.txt --inline_inline -o /demultiplexed_output

However, it asks me to specify the restriction enzyme used. But I do not have this information.

So, what I need: I have several experiments identified by barcode. I need to demultiplex it. I cannot just search for the barcode in the sequence and say that this sequence belongs to the experiment. It can happen that there is a sequencing error in the barcode, so that I need to define a hamming (or any other) distance between the real barcode sequence and the sequence in the read. Which program can do this?

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by tonja.r470
1

Not an answer but a small comment.

Ideally, you should not simply use hamming distance, you should use likelihood. If you have a mismatch with a qc score of 2 and a mismatch with a qc of 40, the former has a greater likelihood given a certain barcode than the latter. We published a paper about maximum-likelihood demultiplexing: https://grenaud.github.io/deML/

If you want to code a bit, you could modify it to incorporate barcode information into the likelihood computation then set a cutoff on the final likelihood for a sample. Furthermore, the likelihood of sample bleed-ins could be computed effortlessly.

ADD REPLYlink written 3.4 years ago by Gabriel R.2.7k

If your sequencing primer is internal to the barcode, how would you have sequenced the barcode? Which sequencing protocol (on the sequencer) was used?

In addition, don't remove your previous question here, that's not good practice.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by WouterDeCoster43k

I was told that they used primer for PCR, then they hanged the barcodes and adapters and sequenced it. What I find my reads is that the barcode comes before the primer, so it corresponds exactly to what I was told.

ADD REPLYlink written 3.4 years ago by tonja.r470
2

PCR primer ≠ sequencing primer. The sequencing primer anneals to the adapter (unless a custom primer was used), so the structure of your library is:

adapter_barcode_PCR-primer_sequence-fragment (and data is barcode_PCR-primer_sequence-fragment)

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by harold.smith.tarheel4.5k

yes, exactly, I corrected it

ADD REPLYlink written 3.4 years ago by tonja.r470
1

If your barcodes are at the very beginning of the reads then why are you having an issue demultiplexing? BBMap may be useful: A: Demultiplexing fastq files with dual barcodes There is also sabre.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax78k

After demultiplexing with BBMap, you can trim the PCR primer sequence from the reads with the same software:

bbduk.sh in=DATA.FASTQ out=TRIMMED_DATA.FASTQ ftl=LENGTH_OF_PCR_PRIMER
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by harold.smith.tarheel4.5k

I am having an issue because when the adapter was cut away, it could have happened that some first bases of the barcode were cut out as well

ADD REPLYlink written 3.4 years ago by tonja.r470

Then you are out of luck. You may have to go back and find the original dataset.

ADD REPLYlink written 3.4 years ago by genomax78k

Cand I just search for the half of the barcode in the reads, for instance? It is my original dataset already.

ADD REPLYlink written 3.4 years ago by tonja.r470

If your barcodes were long enough to begin with then you could, otherwise you would not be able to discriminate, even if you found the remaining barcode fragment.

How long were the barcodes and how many were there? You could look to see if you can identify the PCR primer sequence in the data and get the remaining barcode to the left of that. This would likely need either some custom code and/or awk type solution.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax78k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1307 users visited in the last hour