Question: Guessing The File Format And Information
0
gravatar for Sequer
8.4 years ago by
Sequer150
Malaysia
Sequer150 wrote:

Hi everyone,

I have no information on this large sequence file(1.2Gb) nor its method of sequencing which i received as .txt. Is it advisable to proceed further with analysis. How do you go from here.

For starts, can anyone identify the format/background a little about the following few lines: How you did it or can do it is firstly great to know.

Thanks.

>S111:32:A03SG:1:1:12484:2206
NTCATTTAGATATCTGGCTTACACGCATAGCATTCTCAAGACATGGTACGCTTAATAAGTGATATNAATNTTTCAATTAAANCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>S111:32:A03SG:1:1:13579:2206
NTGTCGAGTCGATGTCTGATGGACGAGCATGAGTGACGCTCTGTTGTTTGTGCAGATTTGGCTGCNTTTNCGTTTTGNGTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>S111:32:A03SG:1:1:12338:2206
NGGTCCAGAAATGTACTTTTGAGGGTTGTTTCAAGACCAACATGACTTTCAAACATATTCTGGAANATTNCTGGGTCATTCNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNNNN
next-gen sequencing • 1.4k views
ADD COMMENTlink modified 5.1 years ago by Biostar ♦♦ 20 • written 8.4 years ago by Sequer150
5
gravatar for Casey Bergman
8.4 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

Just a guess, but this looks like an Illumina GAIIx single-ended run converted from fastq to fasta based on its 150 bp read length and 7-colon seperated identifier string: http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers

ADD COMMENTlink modified 8 months ago by RamRS27k • written 8.4 years ago by Casey Bergman18k
1
gravatar for Neilfws
8.4 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

The format is fasta (although technically, the lines with sequence should be wrapped; typically they contain 60 characters per line and no more than 80).

As raygozak said, the header line seems to contain some information in the ID. It looks like each sequence might be part of a longer sequence (since they all begin S111). However, since the content of header lines is completely arbitrary and determined by whoever created the file, it's impossible to say anything definitive.

I suggest, if possible, that you contact whoever generated the file for clarification.

ADD COMMENTlink written 8.4 years ago by Neilfws48k
2

Yes, it means they're useless data. You want to pursue the originator of these files for the FASTQ files from which this has been generated. The FASTQ files will have the base qualities associated with each base, and will be much more amenable to analysis with NGS tools.

ADD REPLYlink written 8.4 years ago by Daniel Swan13k

would the trailing consecutive NNNN's mean anything useful?

ADD REPLYlink written 8.4 years ago by Sequer150
0
gravatar for Raygozak
8.4 years ago by
Raygozak1.3k
State College, PA, Penn State
Raygozak1.3k wrote:

From what i can see this is a raw sequence file, with the identifier in one row and sequence in another, you can convert this to FASTA file by adding the > symbol on the identifier. The id line seems to contain annotation information about the sequence, separated by colon.

I'd recommend looking up the strings in the position of "A03SG", it looks like some sort of feature (gene, cds,mrna, etc.) id or name, and then compare the feature information with the last two numbers as they look as location information.

ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Raygozak1.3k

the > was clipped by the biostar editor i'm not sure why

ADD REPLYlink written 8.4 years ago by Sequer150

Because ">" at the start of a line is marked up in BioStar to be a block quote. I've edited your sequence so as lines are indented by 4 spaces.

ADD REPLYlink written 8.4 years ago by Neilfws48k

now that i look at it, it looks like the result of a whole genome shotgun sequencing.

ADD REPLYlink written 8.4 years ago by Raygozak1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1768 users visited in the last hour