Question: how to resolve this 1-bases cordinates confusion
gravatar for adeena_hassan
6 months ago by
adeena_hassan40 wrote:

Assalam o alaikum everyone,

I have fetched CDS sequence from whole genome sequence of dog which is downloaded from NCBI. CDS sequence comes in parts as shown below: e.g.

$ cat FABP5_CDS


I have further process and concatenate these parts then i have found that CDS not started from start codon (ATG) this is due to 0-based and 1-baed coordinate system (BED and my BAM file is 1-based ).

I have to add 1 base at the start of my CDS part e.g.

Before adding one base:


After adding one base (A): (now its start from ATG)


My question is that should i add one base at the start of each CDS part or at the start of first CDS part only ?? I'm too much confused. Any idea how to fix it ??

ADD COMMENTlink modified 5 months ago by Alex Reynolds23k • written 6 months ago by adeena_hassan40

What genome build is this from?

According to Ensembl FABP5 is a pseudo-gene in Dog (CanFam v.3.1) with 3 exons







ADD REPLYlink modified 5 months ago • written 5 months ago by genomax44k

FABP5 is not mentioned as pseudogene according to information given in NCBI for dog genome.

ADD REPLYlink written 5 months ago by adeena_hassan40

Are you certain those are CDS features? They don't start with canonincal start codons, nor do they look like they all have stop codons.

ADD REPLYlink written 6 months ago by jrj.healey3.7k

Yes, I'm certain about it. And in the above example all sequences are the parts of a single CDS sequence and there is a stop codon (TAA) at the end of the last part.

ADD REPLYlink written 6 months ago by adeena_hassan40

Oh its a single CDS? That's an odd way to depict the sequence. Then to answer your question you should only add an A to the first part of the sequence, where the ATG would be.

ADD REPLYlink written 6 months ago by jrj.healey3.7k

yes, its a single CDS. I have fetches these sequences from whole genome.Actually i'm confused due to these parts I have coordinates file for extracting a Whole CDS like below and this file format is 1-based.

chr29   28363101    28363137    .   .   .
chr29   28491275    28491447    .   .   .
chr29   28491806    28491907    .   .   .
chr29   28492441    28492494    .   .   .

my point is that why to add a base for only first coordinate why not for all parts ???

ADD REPLYlink written 6 months ago by adeena_hassan40

Because if, as you say, each sequence is PART of the CDS, and not the CDS itself, genes start with an ATG. If the 0-based numbering affects the sequences afterwards too, you don't know what base needs adding so you can't just put an A in there. You have an additional problem, that if they've all have the 1st position base deleted, you won't know what to replace it with, and if your sequences aren't a multiple of 3 for each, there will be frameshifts in it too.

ADD REPLYlink modified 6 months ago • written 6 months ago by jrj.healey3.7k

Thank u for reply

you don't know what base needs adding so you can't just put an A in there

Actually this is not the problem that what base should add because we can find the correct base by changing the the first coordinate e.g.

first coordinate is 28491275 -> 28491274 so by reducing one we can find correct base. I have tested it for ATG and its always A so i put A there.

But I'm not clear that whether I should add one base for others parts or not ?? have You any idea how can i test it ???

ADD REPLYlink written 6 months ago by adeena_hassan40

I'm still not really seeing the problem - my apologies. Maybe I'm being really stupid.

I'm not sure I can really help you, unless you know whether or not the off-by-one error is affected them all or not a priori. It might be easy to fix depending on your dataset.

The last sequence in your example is around 100 kilobases separated from the first sequence in the sample, so it seems pretty unlikely to me that they're part of the same CDS. I'm no eukaryote expert, but that seems like a lot even taking in to account introns.

Do you have a fasta sequence of the whole, uninterrupted sequence we can see so that we can understand what these sequences represent?

ADD REPLYlink written 6 months ago by jrj.healey3.7k

I have genrated consensus FASTA from BAM file. My BAM file is aligned aginst canfam3.1 so i have downloaded annotation file of canfam3.1 from NCBI and used Coordinates for CDS extarction.

ADD REPLYlink written 5 months ago by adeena_hassan40

If this is a published genome, why not just download the gff or genbank and extract the CDSs from that as 1 continuous sequence?

ADD REPLYlink written 5 months ago by jrj.healey3.7k

No, this is not a published genome.

ADD REPLYlink written 5 months ago by adeena_hassan40

But you said you downloaded it from NCBI?

ADD REPLYlink written 5 months ago by jrj.healey3.7k

ohhhh sorry 4 that :o above example not from published genome.

but i have also tried it for dog genome which is available on NCBI same problem for published genome.

ADD REPLYlink written 5 months ago by adeena_hassan40
gravatar for jrj.healey
5 months ago by
United Kingdom
jrj.healey3.7k wrote:

If I understand the problem then I'm not sure you can easily resolve this without more information.

Your off-by-one issue presumably affects all the sequences, but we can't know that for sure unless you can't find out from the software authors or something unequivocally, or you have the whole reference sequence to compare to.

If we assume it does,the first base of the middle sequences can be restored by taking the last base from the sequence preceding it. For the first sequence however, you can't simply assume it needs an A base adding. It's likely, but not all genes start with an ATG, there are other possible start codon. It looks like the last sequence still has its stop codon, so the last sequence is probably fine too.

This really doesn't seem like the best way to do this to me. You should really check it against the full sequence though.

ADD COMMENTlink written 5 months ago by jrj.healey3.7k

Thank u so much your answer is helpful.

ADD REPLYlink written 5 months ago by adeena_hassan40
gravatar for Alex Reynolds
5 months ago by
Alex Reynolds23k
Seattle, WA USA
Alex Reynolds23k wrote:

BED and my BAM file is 1-based

BED and BAM are usually 0-based, half-open [start-1, end). I'd start there, as errors there can cause grief downstream.

ADD COMMENTlink written 5 months ago by Alex Reynolds23k

Yes, you are right i guess problem is in my file format.

actually i have fetched column 4 and 5 from gff3 (annotation) file and made a bed6 file then i have used bedtools getfasta for getting FASTA sequence.

This is wrong approach. I should have to convert gff to bed then used it for sequence fetching. After testing it for multiple genes i will paste it here.

ADD REPLYlink modified 5 months ago • written 5 months ago by adeena_hassan40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 584 users visited in the last hour