Question

Finding the iIllumina index read from raw fastq file

0

Entering edit mode

8.6 years ago

EVR ▴ 610

Hi,

I would like to know about the index used in the sample. I mean from the raw fastq file, I would like know the Illumina index read present in the raw fastq file. Is there any way to find that information from raw fastq file.

Kindly guide me. Thanks in advance.

RNA-Seq illumina index-read • 9.7k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by EVR ▴ 610

Ram · Answer 1 · 2015-10-07

4

Entering edit mode

8.6 years ago

Danielk ▴ 640

It depends on how the raw fastq file was generated form the more raw bcl files. It's sometimes available in the name of the read, and sometimes it's supplied as a separate fastq file with the index read.

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by Danielk ▴ 640

0

Entering edit mode

Hi Daniel,

For an example, if I know the index read of a sample say "XXXPSDE", then can I use grep -c "XXXPSDE" Sample_raw.fastq to obtain the total number of reads containing this index?

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.6 years ago by EVR ▴ 610

0

Entering edit mode

Sorry, I don't follow. What is `XXXPSDE`?

ADD REPLY • link 8.6 years ago by Danielk ▴ 640

0

Entering edit mode

Sorry for the confusion, say for an example, if I have Illumina True seq adapter index 7 , CAGATC. In order to find how many raw reads that has this index, then Can I use grep -c "CAGATC" Sample_raw.fastq to estimate the counts?

ADD REPLY • link 8.6 years ago by EVR ▴ 610

2

Entering edit mode

Yes, the last field of the identifier is the index.

If you don't know the index sequence and/or the FASTQ contains multiple indices, you can use the following to get counts:

zcat NAME_OF_FASTQ | grep '^@H534' | cut -d : -f 10 | sort | uniq -c | sort -nr > indices.txt

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.6 years ago by harold.smith.tarheel ★ 4.9k

1

Entering edit mode

If my fastq file is already decompressed then can I use

grep '^@HWI' Sample_raw.fastq | cut -d : -f 10 | sort | uniq -c | sort -nr > indices.txt

to obtain the indices?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.6 years ago by EVR ▴ 610

2

Entering edit mode

The example that you posted begins with @H534, not @HWI. Otherwise, the command should work.

ADD REPLY • link 8.6 years ago by harold.smith.tarheel ★ 4.9k

1

Entering edit mode

Your grep will also include every instance of CAGATC that's present in your reads (~1 per 4000 nucleotides of sequence). You want to parse only the read identifiers, which typically begin with '@HWI' .

ADD REPLY • link 8.6 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

Hi Harold,

Thanks for your tip. In that case if a read begins like "@H534:291:C6YYCACXX:7:1101:1748:1945 1:N:0:CGATGT" then I have to extract CGATGT which represents adapter index used for this sample. Am I right?

ADD REPLY • link 8.6 years ago by EVR ▴ 610

score 3 · Answer 2 · 2015-10-07

3

Entering edit mode

8.6 years ago

cpad0112 21k

Did you try ExtractIlluminaBarcodes function in picard tools?

ADD COMMENT • link 8.6 years ago by cpad0112 21k