demultiplexing with fastq having barcode and primer sequence
0
0
Entering edit mode
7.5 years ago
tonja.r ▴ 600

I have paired-end Illumina reads with barcode and primer sequence. Barcode and primer sequence are just in .txt file. The experiment was following: Primer was used for PCR and then they hanged the experiment tag (barcode) and the adapter. So, the read are following:

barcode_sequence-PCR_primer_sequence-fragment

I want to demultiplex the reads according to the barcode_sequence and then cut off the primer sequence. Till now I have tried following:

QIIME: split_libraries_fastq.py

I do not have the barcode read fastq files, I have only the sequences of barcode and primers. I contracted the mapping file:

#SampleID   BarcodeSequence LinkerPrimerSequence    Description
1   TCGCAGG AACCTGGTTGATCCTGCCAGT   C4363F2_18.7.   
2   CTCTGCA AACCTGGTTGATCCTGCCAGT   C4363F2_19.7.

So, I need to define -barcode_type not-barcoded. It showed me an error that I need to specify --sample ids, as I had only one input fiel, I have only one sample id

split_libraries_fastq.py  -m mapping.txt  -i Pool1_18S.fastq -o demultiplexed_output/ --barcode_type not-barcoded --sample_ids 1

I get one seqs.fna file where all reads have attached following:

orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0

Stacks: process_radtags

process_radtags -p /fastq -I -b /mapping_radtags.txt --inline_inline -o /demultiplexed_output

However, it asks me to specify the restriction enzyme used. But I do not have this information.

So, what I need: I have several experiments identified by barcode. I need to demultiplex it. I cannot just search for the barcode in the sequence and say that this sequence belongs to the experiment. It can happen that there is a sequencing error in the barcode, so that I need to define a hamming (or any other) distance between the real barcode sequence and the sequence in the read. Which program can do this?

sequencing next-gen demultiplex • 7.3k views
ADD COMMENT
1
Entering edit mode

Not an answer but a small comment.

Ideally, you should not simply use hamming distance, you should use likelihood. If you have a mismatch with a qc score of 2 and a mismatch with a qc of 40, the former has a greater likelihood given a certain barcode than the latter. We published a paper about maximum-likelihood demultiplexing: https://grenaud.github.io/deML/

If you want to code a bit, you could modify it to incorporate barcode information into the likelihood computation then set a cutoff on the final likelihood for a sample. Furthermore, the likelihood of sample bleed-ins could be computed effortlessly.

ADD REPLY
0
Entering edit mode

If your sequencing primer is internal to the barcode, how would you have sequenced the barcode? Which sequencing protocol (on the sequencer) was used?

In addition, don't remove your previous question here, that's not good practice.

ADD REPLY
0
Entering edit mode

I was told that they used primer for PCR, then they hanged the barcodes and adapters and sequenced it. What I find my reads is that the barcode comes before the primer, so it corresponds exactly to what I was told.

ADD REPLY
2
Entering edit mode

PCR primer ≠ sequencing primer. The sequencing primer anneals to the adapter (unless a custom primer was used), so the structure of your library is:

adapter_barcode_PCR-primer_sequence-fragment (and data is barcode_PCR-primer_sequence-fragment)

ADD REPLY
0
Entering edit mode

yes, exactly, I corrected it

ADD REPLY
1
Entering edit mode

If your barcodes are at the very beginning of the reads then why are you having an issue demultiplexing? BBMap may be useful: A: Demultiplexing fastq files with dual barcodes There is also sabre.

ADD REPLY
0
Entering edit mode

After demultiplexing with BBMap, you can trim the PCR primer sequence from the reads with the same software:

bbduk.sh in=DATA.FASTQ out=TRIMMED_DATA.FASTQ ftl=LENGTH_OF_PCR_PRIMER
ADD REPLY
0
Entering edit mode

I am having an issue because when the adapter was cut away, it could have happened that some first bases of the barcode were cut out as well

ADD REPLY
0
Entering edit mode

Then you are out of luck. You may have to go back and find the original dataset.

ADD REPLY
0
Entering edit mode

Cand I just search for the half of the barcode in the reads, for instance? It is my original dataset already.

ADD REPLY
0
Entering edit mode

If your barcodes were long enough to begin with then you could, otherwise you would not be able to discriminate, even if you found the remaining barcode fragment.

How long were the barcodes and how many were there? You could look to see if you can identify the PCR primer sequence in the data and get the remaining barcode to the left of that. This would likely need either some custom code and/or awk type solution.

ADD REPLY

Login before adding your answer.

Traffic: 2127 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6