Converting Gene Expression Omnibus Format To Fasta
1
0
Entering edit mode
11.5 years ago

I have found the a number of reads I want to test against a genome using Bowtie. They are located here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM113418

The problem is that the data is in the format shown below:

> ID_REF = SEQUENCE
> VALUE = NUMBER OF READS 
> ID_REF    VALUE
> AGGCAGTGTAGTTAGCTGATTGC    197 
> TCCCTGGTCTAGTGGTTAGGATTCGGC    177
> TCACAACAACTGTGTGGAGGTATAGGTGT    149 
> TATTTATTGAGGGCCTACTATGTGCCGGG    125

While Bowtie wants reads in this format:

> @r0/2 GAATACTGGCGGATTACCGGGGAAGCTGGAGC
>+EDCCCBAAAA@@@@?>===<;;9:99987776 
>@r1/2 AATGTGAAAACGCCATCGATGGAACAGGCAAT
>+EDCCCBAAAA@@@@?>===<;;9:99987776 
>@r2/2 AACGCGCGTTATCGTGCCGGTCCATTACGCGG
>+EDCCCBAAAA@@@@?>===<;;9:99987776

Is there a standard way for converting the first format into the second? Or are you supposed to process them in some other way? Thanks.

bowtie fasta • 1.9k views
ADD COMMENT
0
Entering edit mode

Edited for readability. Note how your data format was not displayed correctly in the original post; indenting lines with 4 spaces was required.

ADD REPLY
2
Entering edit mode
11.5 years ago

It looks like that cDNA library was sequenced with a 454 platform back in 2006. The raw files are located at the bottom of the page: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5026

They don't contain read quality information, which you need to convert to the fastq format.

I suggest you just take the raw sequences, convert it to a simple fasta file. Bowtie can take in a fasta file with no quality scores with the -f option. Under this option bowtie will just assume all base pairs have a quality of 40.

ADD COMMENT

Login before adding your answer.

Traffic: 1818 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6