Illumina sequence identifiers conversion
2
0
Entering edit mode
9.3 years ago
sumona88 • 0

We have downloaded the genomic sequences of two accessions of the tomato plant from an online database. They are Illumina paired end .fastq files.

They have the standard Illumina sequence identifiers in the following format:

@FCD12W3ACXX:4:1101:1485:2138#GCGTGGAG/1

How do we go about converting this to the standard gene nomenclature format

Solyc02g081130.1.1
genome • 4.6k views
ADD COMMENT
1
Entering edit mode
9.3 years ago
User 59 13k

Those identifiers do not represent gene names, so there is no converstion to be done. The standard information in those identifiers relates to the flowcell number etc.

In order to work out what reads map to what genes you will need to align them to the reference genome and then use the gene annotations for that organism (Solanum lycopersicum) to identify which reads are mapped to those genes. Such annotations and the reference genome would be available from here.

The fastq files may well represent whole genome sequencing of a specific plant or tissue, but they're just (most likely) randomly fragmented aspects of a whole genome. They do not represent the canonical assembly which you must use as your reference point.

There's an NGS primer slideshow here that will cover the kind of workflows you may need.

ADD COMMENT
0
Entering edit mode
9.3 years ago

Specifically, that read contains the following information:

@FCD12W3ACXX:4:1101:1485:2138#GCGTGGAG/1
  • FCD12W3ACXX - Flowcell ID that was used for the Illumina run
  • 4 - The lane on that flowcell this read came from
  • 1101 - The tile within that lane this read came from
  • 1485 - The x coordinate of the cluster (within the tile) that this read came from
  • 2138 - The y coordinate of the cluster (within the tile) that this read came from
  • GCGTGGAG - The index sequence for this read
  • 1 - Indicates that this is read one of a read pair

A more complete read name specification from Illumina follows: http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

Each sequence identifier, the line that precedes the sequence and describes it, needs to be in the following format:

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence>
ADD COMMENT

Login before adding your answer.

Traffic: 1850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6