Question: interpreting fasta header
0
gravatar for genya35
11 months ago by
genya3520
genya3520 wrote:

Hello, I have a text file with thousands of unique sequences in fasta format. Each read has a header in the following format:

122391_Tcount2352_Acount2352_Bcount0_length293

It's obvious that 'length' represents the length of the read but all the other numbers are not clear. I do not know which tool was used to generate the file but blastn was used as some point in the pipeline. I'm curious to see if anyone here has encountered this header format before and can tell me which part of the sequence header represents the count of reads.

Thanks for your help in advance,

Lena

alignment • 425 views
ADD COMMENTlink written 11 months ago by genya3520
1

Hi Lena,

Can you tell us the tool that provided those fasta headers for you? That might help us know what "Tcount", "Acount" and "Bcount" mean.

Thanks!

ADD REPLYlink written 11 months ago by Josh Herr5.7k

Identifying possible tools from the header style/format is the whole question...

ADD REPLYlink written 11 months ago by Joe15k

Lena,

Take a few separate sequences, put it to Blastn or Blastx. It may become clearer what organism you deal with. Then look at NCBI - who has sequensed it. You may even find some articles describing it. Good luck!

ADD REPLYlink modified 11 months ago by RamRS25k • written 11 months ago by natasha.sernova3.7k
3

How does this help with the question about the information in the header?

ADD REPLYlink written 11 months ago by ATpoint26k
1

Lena said, she had thousands of unique sequences.

If it is published, if the source is known - one way is just ask the authors.

It may help or not - but any additional information is valuable.

ADD REPLYlink written 11 months ago by natasha.sernova3.7k

Can you provide a little more background? Where did you get the file? Some co-worker / collaborator passed it to you? If so, ask them. Did you download it from some site / database / paper? Then please tell us where from.

My guess is this is some unpublished internal / personal pipeline, and your only hope at getting a conclusive answer is asking the person who created it.

Just guessing wildly - because guessing is free - I think the first number is the transcript identifier, Tcount (number) is the count of reads for sample T, Acount (number) is the count of reads for sample A, Bcount (number) is the count of reads for sample B, length (number) is the length of the transcript.

ADD REPLYlink written 10 months ago by h.mon28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1547 users visited in the last hour