Question: Demultiplex pair-end fastq reads with barcode 2 in the identifier line
0
gravatar for cb1579
13 days ago by
cb15790
cb15790 wrote:

I have multiplexed pair-end fastq reads with dual barcodes. The issue is that one barcode is present in the header and one is present at the beginning of the read. I need a method to demultiplex this data, but in order to assign a read to an individual, both barcodes are required, as there is overlap between the barcodes. It seems there are packages available to demultiplex using header ID or in-line barcodes to demultiplex, but not both.

example reads:

@700819F:525:HT235BCXX:2:1101:1139:2144 1:N:0:ATCACGAT
CGAATTGCAGATTTTTTCTGAATAAAGCAGTGCAATAAAATTCCCCGCAAAAACACTTNANNNGNNNNNNNNNNNNNNNNNNNNANNNNNNGNTAATAAA
+
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII#<###.####################.######<#.<<GGGI
@700819F:525:HT235BCXX:2:1101:1212:2172 1:N:0:ATCACGAT
AAGGATGCAGGGCATCTCCCTCAGGCTGCGCTCTATCGAAGTCATCCCAGAATTAGATTCCGACCACAGACCAGTCTTAGTCAAACTAGGACCCGAGTGT
+
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIGIIIIIIGGIIIGIIIIIIIIIIIIIGGIGIGIIIGIGI
@700819F:525:HT235BCXX:2:1101:1110:2173 1:N:0:ATCACGAT
GGTTGTGCAGAAAGAGTTGCTGATAAACTTAGCCATGCAGAACAGAATTATGAGTTAGAAGTATGTATATATATACCAATCACTATATCAACCCATTACC
+
<G.G<GGIIIG.GA.GAGAAGG<AA<A<.<<<GA<.<<.G<.G<<A<GGAA....<G.G..<<.GA..<A.<GG<<<.<..<<.GG.A..G..<<<.<<G

Thanks in advance.

demultiplex sequence fastq • 117 views
ADD COMMENTlink modified 12 days ago by Charles Plessy2.4k • written 13 days ago by cb15790

First, use a program that demultiplex by header. After that, use a program that demultiplex by inline barcode.

ADD REPLYlink written 13 days ago by h.mon10k

I've tried that approach, but was unsuccessful. In order to assign reads to an individual, both barcodes are required. The barcodes alone are not unique to individuals, but in combination, they are and can be used to assign reads to a sample.

ADD REPLYlink written 13 days ago by cb15790

I fail to see how this approach don't work. Suppose you have two barcodes in the header and two barcodes inline, identifyng four individuals. First you demultiplex by header, second you demultiplex separately each of two resulting fastqs by inline barcode:

                  [1]                 [2]
original.fastq ___----> header1.fastq ----> header1_inline1.fastq
                  |                   |
                  |                   |_--> header1_inline2.fastq
                  |
                  |_--> header2.fastq ----> header2_inline1.fastq
                                      |
                                      |_--> header2_inline2.fastq

You end up with your four individuals identified.

ADD REPLYlink written 12 days ago by h.mon10k

Looking at the example above step [1] is already done.

For step [2] you may want to look at fastp (fastp, the ultra-fast FASTQ preprocessing tool, is now on BioConda )

ADD REPLYlink written 12 days ago by genomax40k
1
gravatar for Charles Plessy
12 days ago by
Charles Plessy2.4k
Japan
Charles Plessy2.4k wrote:

How about pasting your first barcode to the reads, and demultiplexing with virtual barcodes that represent all the combinations of barcode 1 and 2 ? Here is one way to paste.

paste -d '' <(grep '^@' test.fq | sed s/.*:// | perl -ne 'chomp;print"\n$_\n\n","I"x(length($_)),"\n"') test.fq 
@700819F:525:HT235BCXX:2:1101:1139:2144 1:N:0:ATCACGAT
ATCACGATCGAATTGCAGATTTTTTCTGAATAAAGCAGTGCAATAAAATTCCCCGCAAAAACACTTNANNNGNNNNNNNNNNNNNNNNNNNNANNNNNNGNTAATAAA
+
IIIIIIIIGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII#<###.####################.######<#.<<GGGI
etc...

The command is a bit cryptic, but basically it reads as follows: to the original file, paste without delimiter a virtual file made by extracting the barcode sequence from the read names, and for each barcode outputting and empty line, followed by a line containing the barcode, followed by an empty line, followed by a quality line with one "I" per base in teh barcode.

ADD COMMENTlink written 12 days ago by Charles Plessy2.4k

Thanks, this is what I was looking for. I am feeding multiplexed RADSEq reads into ipyrad to look for SNPs, so this is ideal because it works with the program nicely.

ADD REPLYlink written 12 days ago by cb15790

Go ahead and accept this answer (green check mark) to provide closure to this thread.

ADD REPLYlink written 12 days ago by genomax40k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1005 users visited in the last hour