Question: Demultiplexing Illumina data
3
gravatar for dbro970
11 months ago by
dbro97030
dbro97030 wrote:

Hi, I'm fairly new to bioinformatics in general but have some Illumina Sequence data in the form of two sequence files and a seperate barcode file. I’m trying to demultiplex this data, but can't seem to find a way to do it? Can anyone give me some advice?

Read File 1:

@HWI-M01156:58:000000000-A14TV:1:1101:15941:1367 1:N:0:
TACGTAGGGTGCGGGCGTTAATCGGAATAACTGGGCGTAAAGGGCACGCAGGCGGTTATTTAAGTGAGGTGTGAAATCCCCGGGCTTAACCTGGGAATTGCATTTCTGACTGGGTAACTTGAGTACTTTTGGGGGGGGTAGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGGGGAATACCGAAGGCGAAGGCAGCCCCTTGGGATTGTACTGACGCCCTTGTGGGAAAGGGGGGGGGGCAAACG

Read File 2:

@HWI-M01156:58:000000000-A14TV:1:1101:15941:1367 3:N:0:
CCTTTTTTCTCCCCACTCTTTCGCTCTTTTTCTTCTTTTCTTTCCCTTTTTTTTTCCTTCGCCTTCTTTTTTCCTCCTCATCTCTTCGCTTTTCACCGCTNNNCNNNTNNTTCTTCCCCTCTCTTACTTACTCTCTTTTCCCATTCTCCATTTCATTTCCTTGTTTTTCCCCGTTCTTTTCCCTCTTTCCTTTATTTCCCTCCTCCTTCCCCTTTCCCCCCTTTTTTTCCTTTTTTCCTCTCCCCCTTCCT

Barcodes:

@HWI-M01156:58:000000000-A14TV:1:1101:15941:1367 2:N:0:
TTCCGTAGGGTT

Thanks Heaps in advance!

demultiplex next-gen • 2.1k views
ADD COMMENTlink modified 11 months ago by Gabriel R.2.6k • written 11 months ago by dbro97030

See this thread: demultiplex a dataset when you have barcodes as a separate fastq

If it is easily feasible ask the sequence provide to demultiplex the data without putting the index read in a separate file (this is standard way of demultiplexing data).

ADD REPLYlink written 11 months ago by genomax65k
3
gravatar for Gabriel R.
11 months ago by
Gabriel R.2.6k
Center for Geogenetik Københavns Universitet
Gabriel R.2.6k wrote:

There are different programs to do this but you can try our own deML which does maximum-likelihood demultiplexing:

  deML -i index.txt -f todemultiplex.fq1.gz  -r todemultiplex.fq2.gz -if1 todemultiplex.i1.gz   -o demultiplexed_
ADD COMMENTlink written 11 months ago by Gabriel R.2.6k

Thank you for that. Good solution to keep on file.

What different programs are you referring to for this specific case (where the index sequences are in a separate file)?

ADD REPLYlink written 11 months ago by genomax65k

I think Bayexer can do this as well, they compared it to deML in their paper. I am not sure about the input format though but it is another third-party demultiplexer.

ADD REPLYlink written 11 months ago by Gabriel R.2.6k

I noticed that your solution requires a file with index sequences. It may be cumbersome if one is dealing with molecular tags etc. Can deML be made to work without the index file?

ADD REPLYlink written 11 months ago by genomax65k

I assume you refer to the files containing the per cluster index. At present time, you would need to dump those in a file or file descriptor but I could modify it easily by adding an option if someone requests it.

ADD REPLYlink written 11 months ago by Gabriel R.2.6k

Like in this case OP may/may not have a list of all index sequences. As you said the index list could be extracted from the I* file but making the program independent of that requirement may make it friendlier to use.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax65k

I am slightly confused, are you referring to the per-cluster index sequence or the list of the all potential index sequences?

ADD REPLYlink written 11 months ago by Gabriel R.2.6k

Potential.

I assume -i index.txt refers to a list of known indexes?

ADD REPLYlink written 11 months ago by genomax65k

correct. How deML does is compute the probability that each sequence/cluster pertains to a given read group. But it needs a list of potential samples with their associated indices.

ADD REPLYlink written 11 months ago by Gabriel R.2.6k

Hi, Thanks everyone for your answers, these are old data unfortunately that we've got from a collaborator so I don't think that the demultiplexed data is available and I don't have an index file but I'd be interested in hearing more about how you could extract an index file from the files I've already got though? Thanks everyone!

ADD REPLYlink written 11 months ago by dbro97030

I mean you could get a frequency of the different sequences in the index file but how will you know what sample they pertain to?

ADD REPLYlink written 11 months ago by Gabriel R.2.6k

You can extract unique indexes from the I* file by doing this:

 zcat File_I1_001.fastq.gz | grep -B 1 "+$" --no-group-separator | grep -v "+$" | sort | uniq > index.txt

As @Gabriel said, you would need to know the mapping information to make use of the demultiplexed data.

Sample 1 <--> Index 1
Sample 2 <--> Index 2

etc.

ADD REPLYlink written 11 months ago by genomax65k

Hi, So I have mapping information however unfortunately this only connects the sample identifier to the relevant metadata and doesn't contain any information such as the barcode/linker sequences etc. I also only have some of the identifiers in this file as while I have the raw read data from several sequencing runs only some of the samples on these runs are relevant for my analysis (so the remaining identifiers are with held for data confidentiality reasons).

@genomax, thanks a bunch for the script, I've been running the code like this:

zcat s_1_1_sequence.fastq.tar.gz | grep -B 1 "+$" --no-group-seperator | grep -v "+$" | sort | uniq > index.txt

unfortunately I keep getting the error

grep: unrecognized option '--no-group-seperator'
Usage: grep [OPTION]... PATTERN [FILE]...
Try: 'grep--help' for more information

Thanks everyone for all of your help!

ADD REPLYlink modified 11 months ago by genomax65k • written 11 months ago by dbro97030

Try this command instead then. Make sure you are running it using the file containing the index sequences (short reads). That appears to be file 2. Your file also appears to be tarred and gzipped. so you may be best off uncompressing the file. You could pipe these things but let us do it the dumb way.

tar -zxvf s_1_1_sequence.fastq.tar.gz

cat s_1_1_sequence.fastq | grep -B 1 "+$" | grep -v "+$" | grep -v "\-\-"| sort | uniq > index.txt
ADD REPLYlink modified 11 months ago • written 11 months ago by genomax65k

Thanks for all your help genomax, I've tried running that code. I'm curious how you could determine that file 2 is the file which contains the index sequences, I'm not sure myself so I ran it on all of the files just to check the output. It seems to be isolating the sequences but not then associating them with the relevant sample ID (see below for the first line from each file). Can you think of a reason this might be occuring? Thanks for all of your help with this so far, your a legend!

Code run on first file (read file 1)

AAAAAAAAAAAACAGGAGAAGGAAAGCGAGGGTATCCTACAAAGTCCAGCGTACCATAAACGCAAGCCTCAACGAAGCGACGAGCAGGAGAGCGGTCAGGAGAAATGCAAACGAGGTGACCCGGCAGAAAATCGAAATAACCGTCGGTTAAATCCAAAACGTCAGAAGCCGTAAAGAGCATAAAAGAGCCGAAAGCGGGCGGGAAACGAACGGGAGGTGCAGAAACTTGTCCCATGTGACCTAACCGACAA

Code run on the second file (read file 2)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

code run on barcodes file

AAAAAAAAAAAA
AAAAAAAAAAAC
ADD REPLYlink modified 11 months ago by genomax65k • written 11 months ago by dbro97030
1

In Illumina technology order of reads is as follows (Index reads present only when sample is single- or dual-indexed):

Read 1 --> Index 1 --> Index 2 --> Read 2

Corresponding fastq files get named as (unless changed):

R1 --> I1 --> I2 --> R2  (or they could also be named R1, R2, R3, R4)

You are going to find many non-sense index sequences (there are always errors and other things one can't explain) that you would need to weed through.

You would then take the indexes you are interested in put them in the index.txt file and then use @Gabriel's tool

AAAAAAAAAAAA
AAAAAAAAAAAC

As we have already said you need to have the metadata for the sample-index association since there is no way to recover that from this analysis.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax65k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 865 users visited in the last hour