Hi
Thanks for reading my message.
first of all, I am very new to this wonderful area, and I would like someone to help me with a question.
I have two raw data (R1 and R2) from illumina MIseq, within them are three samples, which have two pairs of different index sequences each.
I would like to know how I could extract each one of my samples separately? with what software could I do it?
pd: I only have the list of sequences per sample
I am assuming you received a pair of read files that has the non-demultiplexed data. You will need to know the index pairs that go together. You will need to be comfortable with unix command line in order to use the following instructions.
Download BBTools (https://sourceforge.net/projects/bbmap/) and uncompress the archive. You will want to use "demuxbyname" included in the software you downloaded. Usage:
"Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.
In the output filename, the "%" symbol gets replaced by the index sequence; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. Adjust input file name as necessary.
This reply is better suited as a comment on genomax's answer. Could you make the appropriate change please? That would involve the following steps:
Copy the contents of your reply from this answer (you can edit this answer (Ctrl/Cmd + click the link to open it in a new tab) and do a Select All -> Copy there).
You need to copy and paste the code in formatted window above into a new file and save as text on your own server/computer. Name the file with code bc.awk. Then run zcat /info/Samples/cdv/R1_001.fastq.gz | awk -f bc.awk to get a result. This should list indexes present in your data file along with number of reads for each.
In case of Illumina reads even non-demultiplexed data should have the index sequences in the read headers, if one had used standard Illumina multiplexing. An example header (for a 2-D index).
As you can see your reads are missing this critical bit of information.
That said FASTQ WikiPedia entry says that:
Note that more recent versions of Illumina software output a sample
number (as taken from the sample sheet) in place of an index sequence.
For example, the following header might appear in the first sample of
a batch:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1
I have personally not seen this format in read headers. Your read headers also appear to have a 0 in that location.
Are you sure your data was post-processed correctly? You should double-check with your sequence provider.
Hi zion22,
This reply is better suited as a comment on genomax's answer. Could you make the appropriate change please? That would involve the following steps:
Select All
->Copy
there).Add Comment
on genomax's post here: A: help to extract dataAdd Comment
buttonmoderate
back in your answer here: A: help to extract dataDelete Post
Submit
button.Thank you!
P.S: Please do not add answers unless you're answering the top level question. Use
Add Comment
orAdd Reply
as appropriate.That tells me that you are likely not providing the correct index sequence combinations. Can you save the following code in a file
bc.awk
and then run it like this
zcat /info/Samples/cdv/R1_001.fastq.gz | awk -f bc.awk
and show us the result.sorry, where can you find this file?
You need to copy and paste the code in formatted window above into a new file and save as text on your own server/computer. Name the file with code
bc.awk
. Then runzcat /info/Samples/cdv/R1_001.fastq.gz | awk -f bc.awk
to get a result. This should list indexes present in your data file along with number of reads for each.that was the result:
0: 5821378
That is odd. Can you show us the result of:
zcat /info/Samples/cdv/R1_001.fastq.gz | head -8
andzcat /info/Samples/cdv/R2_001.fastq.gz | head -8
?Sorry to bother you so much. This was the result. R1_001.fastq.gz:
R2_001.fastq.gz:
In case of Illumina reads even non-demultiplexed data should have the index sequences in the read headers, if one had used standard Illumina multiplexing. An example header (for a 2-D index).
As you can see your reads are missing this critical bit of information.
That said FASTQ WikiPedia entry says that:
I have personally not seen this format in read headers. Your read headers also appear to have a
0
in that location.Are you sure your data was post-processed correctly? You should double-check with your sequence provider.