Question: How to Demultiplex a fastq.gz file.
0
gravatar for eli_bayat
29 days ago by
eli_bayat0
eli_bayat0 wrote:

I am a new postdoc student and I was given a folder of fastq.gz files. I was told they are not de-multiplexed and I need to basically extract each sample information separately from each of these fastq file (they contain info for multiple subjects) and save it as fastq file and run dada2 pipeline on them to get ASVs. My apologies if I am not using some terms correctly, I am very new to this. I worked with ASV table before, but never done de-multiplixing before. If you can help me how to do it or what software or platform I can use to separate these samples, I appreciate your help.

ADD COMMENTlink modified 29 days ago • written 29 days ago by eli_bayat0
2

Are the sample barcodes in the indices, or are they internal to the read? Have they been pulled out the the read and moved to the read name? If the usual Illumina indices are used to multiplex, it is far easier for them to be demultiplexed as the fastqs are being generated than to do it after the fact.

ADD REPLYlink written 29 days ago by swbarnes26.5k

This is how the data looks like when I open a fastq file in terminal. There is also a Barcode text file with a column of sample ID and Barcode pair name.

enter image description here

MWI006 is the sample ID and I have a bunch of that with different numbers in one fastq file, which means I need to Demultiplex the samples.

ADD REPLYlink modified 29 days ago by ATpoint23k • written 29 days ago by eli_bayat0

That pic doesn't work for me, just copy and paste the text.

ADD REPLYlink written 29 days ago by swbarnes26.5k

Sorry about that, I am pretty new to this forum.

@M01380:62:000000000-B547W:1:1102:20819:1013 1:N:0:MWI006 NGCCTCTT|1|NCTGCATA|1
NGTAGAGTTTGATTCTGGCTCAGGATGAACGCTGACAGAATGCTTAACACATGCAAGTCTACTTGATCCTTCGGGTGATGGTGGCGGACGGGTGAGTAACGCGTAAAGAACTTGCCCTGCAGTCTGGGACAACATTTGGAAACGAATGCTAATACCGGATATTATGCGAACTTCGCATGTAGCTCGTATGAAAGCTATATGCGCTGCAGGATAGCTTTGCGTCCTATTAGCTAGTTGGTGAGGTAACGGATCACCAAGGCCATGATCGGTAGCCGGGCTGAGTGTGTGAACGGCCGCAAGG
+
#8BCCGGGGGGGGGGGGGFGGDFGFFGGFCFGGGDGFF8CEAFGFGGGGEDFGGGGGGFFGGGGGGGGGCFAF7C<+DDFGGGD8@EFFFGGGGFGGGGCCFGGDCGDD?,B?ECG?A<FGDFGGGGGGGFF8FGGFGGG9EFF7BFFFFFFDGCFG7CEFAF@FG,3FGGGG,+FCECGG=CC9:CCFFGGF9>CFFCGGFGGGC*6<@@,9?FC@FG@EC88E?9F?F6>76+>AFC5C5EFAC6C**//02A=EGFEE437>:+1***122)/)/7*)9*:**)01*)87)4),)-1:

@M01380:62:000000000-B547W:1:1102:16288:1015 1:N:0:MWI006 NGCCTCTT|1|NCTGCATA|1
NTACGTAGGGTTCGATCCTGGCTCAGGATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATCATCAAAGATTGCTTTGATGGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGACAACACTGGGATAGCCTTTCGAAAGAAAGATTAATACCGGATGGCATAATTATTACGCATGGGATAATTATTAAAGAATTTCGGTGGCCGATGGGGGTGCGTTACATTAGGCAGATGGCGGGGGAAAGGCCTACCAAAACAACGACGGATAGGGTGTGTGG
+
#8@ACGG@BEFF87EFFFFF88CFGGFG,EECCF,CF:,,F<FECCFFDFGFGGFDCCEFFFGEGGG:@FCCDF8FFFGFGG8,9@,,?<C<CFGGEFF8FCCEEC7=7FFCG+8+AE<CBEGFEFF:BFFGFC8,,BF7@7CE8B=FAB8,5,,7@FAE**><@,FCCFA@FFCC;,>11*5*>FGFG9,@C9,6=CEGG88+29+3?C+23+49<=9+?BFD8***3==/:=*;**/*1:C**+2+0:+3<C**+76==7*))*2979C**2)2)9)*)*.1>)87:.,9*.,*4).4(

@M01380:62:000000000-B547W:1:1102:15376:1016 1:N:0:MWI005 NGCCTCTT|1|NTAAGGAG|1
NTACGTAGGGTTCGATTCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAAGCGGTTTGTCGGAAGTTTTCGGATGGAAGATAAACTGACTGAGTGGCGGACGGGTGAGTAACGCGTGGGTAAACTGCCTCATACAGGGGGGTAAAAGTTAGAACTTACTGATAATACAGCATAAGACAACAGCACCGAATGGTGCAGGGGTAAAAACACCGGGGGTATGAGATGGAGTCGAGAATGATAAGCAAGTTGGAGGGGTGAGTGCATACCAAAACGACGCTCAGCA
ADD REPLYlink modified 29 days ago by genomax71k • written 29 days ago by eli_bayat0

I looked for what each line means, and I get it, the only part I am not getting is NGCCTCTT|1|NCTGCATA|1 at the end of first line. can you help me with this? what it means?

ADD REPLYlink written 29 days ago by eli_bayat0
1

That probably the sequences of the two indices, but why didn't the people who made the fastqs demultiplex for you? Anyway, you can write a little script with whatever to split out the reads by the sample name, since for some reason that's in the read name. If you have a modest number of samples, you can grep for the desired sample names one at a time.

ADD REPLYlink written 29 days ago by swbarnes26.5k
1

You typically demultiplex Illumina sequencing data with the program bcl2fastq. As the name implies, it converts the original basecall files (.bcl) from the sequencer into the demultiplexed .fastq.gz output directly. This is done with a .csv formatted samplesheet. Your best bet is to figure out who did the sequencing and get them to demultiplex it. This is typically done automatically by the sequencing facility. Trying to demultiplex it after the fact is kind of a waste of time because it will be much harder and slower.

ADD REPLYlink modified 29 days ago • written 29 days ago by steve2.3k
1

if you wanted to try to do this manually yourself, you might look at the posts here: How to subset fastq data based on leading nt of sequences?

ADD REPLYlink written 29 days ago by steve2.3k

That's not what the OP needs. Their indices are not embedded in the read.

ADD REPLYlink written 29 days ago by swbarnes26.5k

This is how the data looks like when I open a fastq file in terminal. There is also a Barcode text file with a column of sample ID and Barcode pair name.

enter image description here

ADD REPLYlink modified 29 days ago by ATpoint23k • written 29 days ago by eli_bayat0
1

Hi eli_bayat,

welcome to Biostars. No need to apologize for being new to the community, we all were at some point. As advice, it is recommended to add data and code examples as plain text and highlight them by using the code button 10101 which allows easy copy/paste for others to, e.g. test code one might suggest to you.

For embedding images, please use the image buttom (the one right of the 10101 bottom). You have to paste-in the full link to the image from the image hoster so e.g. https://i.ibb.co/HF8PH8T/(...).png to make sure it is properly embedded. I made the changes in this thread this time. Cheers!

ADD REPLYlink modified 29 days ago • written 29 days ago by ATpoint23k

Thanks! I appreciate it :)

ADD REPLYlink written 28 days ago by eli_bayat0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 671 users visited in the last hour