Question: Finding the iIllumina index read from raw fastq file
0
gravatar for EVR
5.2 years ago by
EVR570
Earth
EVR570 wrote:

Hi,

I would like to know about the index used in the sample. I mean from the raw fastq file, I would like know the Illumina index read present in the raw fastq file. Is there any way to find that information from raw fastq file.

Kindly guide me. Thanks in advance.

 

 

 

index-read rna-seq illumina • 5.0k views
ADD COMMENTlink modified 5.2 years ago by cpad011214k • written 5.2 years ago by EVR570
4
gravatar for Danielk
5.2 years ago by
Danielk610
Karolinska Institutet, Stockholm, Sweden
Danielk610 wrote:

It depends on how the raw fastq file was generated form the more raw bcl files. It's sometimes available in the name of the read, and sometimes it's supplied as a separate fastq file with the index read. 

ADD COMMENTlink written 5.2 years ago by Danielk610

Hi Daniel,

For an example, if I know the index read of a sample say  "XXXPSDE", then can I use

grep -c "XXXPSDE" Sample_raw.fastq to obtain the total number of reads containing this index?

ADD REPLYlink written 5.2 years ago by EVR570

Sorry, I don't follow. What is `XXXPSDE`?

ADD REPLYlink written 5.2 years ago by Danielk610

Sorry for the confusion, say for an example, if I have Illumina True seq adapter index 7 , CAGATC. In order to find how many raw reads that has this index, then Can I use grep -c "CAGATC" Sample_raw.fastq to estimate the counts?

ADD REPLYlink written 5.2 years ago by EVR570
1

Your grep will also include every instance of CAGATC that's present in your reads (~1 per 4000 nucleotides of sequence). You want to parse only the read identifiers, which typically begin with '@HWI' .

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by harold.smith.tarheel4.6k

Hi Harold,

Thanks for your tip. In that case if a read begins like "@H534:291:C6YYCACXX:7:1101:1748:1945 1:N:0:CGATGT" then I have to extract CGATGT which represents adapter index used for this sample. Am I right?

ADD REPLYlink written 5.2 years ago by EVR570
1

Yes, the last field of the identifier is the index.

If you don't know the index sequence and/or the FASTQ contains multiple indices, you can use the following to get counts:

zcat NAME_OF_FASTQ | grep '^@H534' | cut -d : -f 10 | sort | uniq -c | sort -nr > indices.txt
ADD REPLYlink modified 12 months ago by _r_am31k • written 5.2 years ago by harold.smith.tarheel4.6k
1

If my fastq file is already decompressed then can I use

grep '^@HWI' Sample_raw.fastq | cut -d : -f 10 | sort | uniq -c | sort -nr > indices.txt

to obtain the indices?

ADD REPLYlink modified 12 months ago by _r_am31k • written 5.2 years ago by EVR570
2

The example that you posted begins with @H534, not @HWI. Otherwise, the command should work.

ADD REPLYlink written 5.2 years ago by harold.smith.tarheel4.6k
3
gravatar for cpad0112
5.2 years ago by
cpad011214k
Hyderabad India
cpad011214k wrote:

Did you try ExtractIlluminaBarcodes function in picard tools?

ADD COMMENTlink written 5.2 years ago by cpad011214k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1314 users visited in the last hour