Question

help to extract data

0

Entering edit mode

5.3 years ago

zion22 ▴ 70

Hi Thanks for reading my message. first of all, I am very new to this wonderful area, and I would like someone to help me with a question. I have two raw data (R1 and R2) from illumina MIseq, within them are three samples, which have two pairs of different index sequences each.

I would like to know how I could extract each one of my samples separately? with what software could I do it? pd: I only have the list of sequences per sample

again thank you very much for your help

next-gen • 1.3k views

ADD COMMENT • link 5.3 years ago by zion22 ▴ 70

score 0 · Answer 1 · 2019-01-05

Welcome to Biostars!

I am assuming you received a pair of read files that has the non-demultiplexed data. You will need to know the index pairs that go together. You will need to be comfortable with unix command line in order to use the following instructions.

Download BBTools (https://sourceforge.net/projects/bbmap/) and uncompress the archive. You will want to use "demuxbyname" included in the software you downloaded. Usage:

demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...

"Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

In the output filename, the "%" symbol gets replaced by the index sequence; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. Adjust input file name as necessary.

You can also use

 demuxbyname.sh in1=file_R1.fq.gz in2=file_R2.fq.gz out1=out_%_R1.fq.gz out2=out_%_R2.fq.gz prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...

GenoMax · Answer 2 · 2019-01-08

0

Entering edit mode

5.3 years ago

zion22 ▴ 70

Hi. Thank you so much for helping me. I executed the command as you told me, which was next:

demuxbyname.sh in1=/info/Samples/cdv/R1_001.fastq.gz in2=/info/Samples/cdv/R2_001.fastq.gz out1=out_%_R1.fq.gz out2=out_%_R2.fq.gz prefixmode=f names=CTCTCTAT+TAAGGCGA,CTCTCTAT+CGTACTAG,CTCTCTAT+AGGCAGAA,TAGATCGC+CGTACTAG,TAGATCGC+AGGCAGAA,CTCTCTAT+CTCTCTAC

and the console's response was:

Set INTERLEAVED to false

Input is being processed as paired

Time: 19.786 seconds.

Reads Processed: 11642756 588.44k reads/sec

Processed Bases: 3033491707 153.32m bases/sec

Reads Out: 0

Bases Out: 0

but the resulting files do not have any data

ADD COMMENT • link 5.3 years ago by zion22 ▴ 70

0

Entering edit mode

Hi zion22,

This reply is better suited as a comment on genomax's answer. Could you make the appropriate change please? That would involve the following steps:

Copy the contents of your reply from this answer (you can edit this answer (Ctrl/Cmd + click the link to open it in a new tab) and do a Select All -> Copy there).
Click on Add Comment on genomax's post here: A: help to extract data
Paste the copied text
Click on the green Add Comment button
Click on moderate back in your answer here: A: help to extract data
Choose Delete Post
Click on the blue Submit button.

Thank you!

P.S: Please do not add answers unless you're answering the top level question. Use Add Comment or Add Reply as appropriate.

ADD REPLY • link 5.3 years ago by Ram 43k

0

Entering edit mode

but the resulting files do not have any data

That tells me that you are likely not providing the correct index sequence combinations. Can you save the following code in a file bc.awk

BEGIN { FS = ":"; }

((NR % 4) == 1) { barcodes[$10]++; }

END {
  for (bc in barcodes) {
            print bc": "barcodes[bc]"";
        }
    }

and then run it like this zcat /info/Samples/cdv/R1_001.fastq.gz | awk -f bc.awk and show us the result.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

sorry, where can you find this file?

ADD REPLY • link 5.3 years ago by zion22 ▴ 70

0

Entering edit mode

You need to copy and paste the code in formatted window above into a new file and save as text on your own server/computer. Name the file with code bc.awk. Then run zcat /info/Samples/cdv/R1_001.fastq.gz | awk -f bc.awk to get a result. This should list indexes present in your data file along with number of reads for each.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

that was the result: 0: 5821378

ADD REPLY • link 5.3 years ago by zion22 ▴ 70

0

Entering edit mode

That is odd. Can you show us the result of: zcat /info/Samples/cdv/R1_001.fastq.gz | head -8 and zcat /info/Samples/cdv/R2_001.fastq.gz | head -8?

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

Sorry to bother you so much. This was the result. R1_001.fastq.gz:

@M03377:2:000000000-AJN1Y:1:1101:20555:1382 1:N:0:0
TTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTCTTTTTCTCTTTTTTTTTTCTCTTTTCTTCTTTCTTTTCTCTTTCTTCTTCTTTTCCTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTCTTTTTTCTTTCTTTCTTTTTTTTCTTTTTTTTCTTTTTCTTTTTTTTTTTCTTTTTTTTCTTTTCTTTCTTTTTTTCTTTTTTCTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
-@@@@+@@@--;,,,++67+7+8++++77=4,9?,9,,,,,,,,94++++,8,,,,5,5?,,,,,,,,8,,,85,,88,,<,8??,,,3,,,33=++5******111<,,,,,,1**//****/*1/888++3999:++39+2::+<:+++*/*++++0077*+39:+++29:>*)177)1****2*1)0)*****1*0*.6467>1(0*-))((*.)))().-5)(-(((-(((-12,(,(,311(()).),(((-33((--(-((--,(((((-23(,((-41(-(-,((-(((((((
@M03377:2:000000000-AJN1Y:1:1101:8486:1600 1:N:0:0
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTT
+
-6,,-++++++++6+++++++++++++++++++++++++++++**********************/******************/**************************************/***/******************************/****/**/*****)))/))/))/)))/)//)).))))))))()(),)(,).((,((.(,((,,(,(-(,(,(((,(((((-(((,(((-(-((,,((,,((,(,(,(,,((((,((-(((,1,,(,((-(,(-((-((-))

R2_001.fastq.gz:

@M03377:2:000000000-AJN1Y:1:1101:20555:1382 2:N:0:0
CTTTTTTTTTTTCTTTCTTTTTCTTTTTCTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTCTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTCTTTATCTTTTCTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
--8,,,767+++,,66,,,,66,,,,6,,,,,6,+++++7+6,,,,666+74=+74+++4+3+3+333:3*1**,,,,,,****,,,22,**/*2*/11**/***1/*****+2+++++++++0//18*+3::++/*******/)*)))2))))))1)*200**05)..555)))))0.)-0-*-*))).))).-)))(((-,((,(((-,,(-,,,(,--((-((.4.).,(,,((-(--,(-2(,((-())).,,((((-((,()-)),-(((((())(((,((((,,((((((((,(
@M03377:2:000000000-AJN1Y:1:1101:8486:1600 2:N:0:0
CTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
--,,,,++++++++++++++++++++++++++++++++++++++***************************************************************************************************/)*)))*))))))/)))))/))))).))))))))).)()(.,)((((((,(((((((((((()))),()-)))((((,((((((((,((((((,((-((,((((,((((((((((-))))(((((,((((((-(())((((,((((((,((((((((

ADD REPLY • link updated 5.3 years ago by GenoMax 141k • written 5.3 years ago by zion22 ▴ 70

0

Entering edit mode

In case of Illumina reads even non-demultiplexed data should have the index sequences in the read headers, if one had used standard Illumina multiplexing. An example header (for a 2-D index).

@EAS139:136:FC706VJ:7:1101:4604:1209 1:N:0:*TTGCTT+ACTGAC*

As you can see your reads are missing this critical bit of information.

That said FASTQ WikiPedia entry says that:

Note that more recent versions of Illumina software output a sample number (as taken from the sample sheet) in place of an index sequence. For example, the following header might appear in the first sample of a batch:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1

I have personally not seen this format in read headers. Your read headers also appear to have a 0 in that location.

Are you sure your data was post-processed correctly? You should double-check with your sequence provider.

ADD REPLY • link 5.3 years ago by GenoMax 141k