Question

Illumina sequence identifiers conversion

0

Entering edit mode

9.3 years ago

sumona88 • 0

We have downloaded the genomic sequences of two accessions of the tomato plant from an online database. They are Illumina paired end .fastq files.

They have the standard Illumina sequence identifiers in the following format:

@FCD12W3ACXX:4:1101:1485:2138#GCGTGGAG/1

How do we go about converting this to the standard gene nomenclature format

Solyc02g081130.1.1

genome • 4.6k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by sumona88 • 0

Ram · Answer 1 · 2014-12-27

Those identifiers do not represent gene names, so there is no converstion to be done. The standard information in those identifiers relates to the flowcell number etc.

In order to work out what reads map to what genes you will need to align them to the reference genome and then use the gene annotations for that organism (Solanum lycopersicum) to identify which reads are mapped to those genes. Such annotations and the reference genome would be available from here.

The fastq files may well represent whole genome sequencing of a specific plant or tissue, but they're just (most likely) randomly fragmented aspects of a whole genome. They do not represent the canonical assembly which you must use as your reference point.

There's an NGS primer slideshow here that will cover the kind of workflows you may need.

Ram · Answer 2 · 2014-12-28

Specifically, that read contains the following information:

@FCD12W3ACXX:4:1101:1485:2138#GCGTGGAG/1

FCD12W3ACXX - Flowcell ID that was used for the Illumina run
4 - The lane on that flowcell this read came from
1101 - The tile within that lane this read came from
1485 - The x coordinate of the cluster (within the tile) that this read came from
2138 - The y coordinate of the cluster (within the tile) that this read came from
GCGTGGAG - The index sequence for this read
1 - Indicates that this is read one of a read pair

A more complete read name specification from Illumina follows: http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

Each sequence identifier, the line that precedes the sequence and describes it, needs to be in the following format:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence>