Question: how to resolve this 1-bases cordinates confusion
0
gravatar for adeena_hassan
26 days ago by
adeena_hassan40 wrote:

Assalam o alaikum everyone,

I have fetched CDS sequence from whole genome sequence of dog which is downloaded from NCBI. CDS sequence comes in parts as shown below: e.g.

$ cat FABP5_CDS

>chr29:28363101-28363137
TGACTGTGTCAGTCCAGGTTCTCTGGGGGACTGAGG
>chr29:28491275-28491447
AGTGGGAATGGCTCTGCGAAAGGTGGGTGCAATGGCCAAACCAGATTGTATCATCTCTTCTGACGGCAAAAACCTCACCATAAAA
>chr29:28491806-28491907
CTGTCTGCAACTTCACAGACGGCGCATTGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACAAGAAAGTTGGAAGATGGGAAATTGGTGGTG
>chr29:28492441-28492494
AATGCGTCATGAACAATGTCACCTGTACGCGGATCTATGAAAAAGTAGAGTAA

I have further process and concatenate these parts then i have found that CDS not started from start codon (ATG) this is due to 0-based and 1-baed coordinate system (BED and my BAM file is 1-based ).

I have to add 1 base at the start of my CDS part e.g.

Before adding one base:

>chr29:28363101-28363137
TGACTGTGTCAGTCCAGGTTCTCTGGGGGACTGAGG
>chr29:28491275-28491447
AGTGGGAATGGCTCTGCGAAAGGTGGGTGCAATGGCCAAACCAGATTGTATCATCTCTTCTGACGGCAAAAACCTCACCATAAAA
>chr29:28491806-28491907
CTGTCTGCAACTTCACAGACGGCGCATTGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACAAGAAAGTTGGAAGATGGGAAATTGGTGGTG
>chr29:28492441-28492494
AATGCGTCATGAACAATGTCACCTGTACGCGGATCTATGAAAAAGTAGAGTAA

After adding one base (A): (now its start from ATG)

>chr29:28363101-28363137
ATGACTGTGTCAGTCCAGGTTCTCTGGGGGACTGAGG
>chr29:28491275-28491447
AGTGGGAATGGCTCTGCGAAAGGTGGGTGCAATGGCCAAACCAGATTGTATCATCTCTTCTGACGGCAAAAACCTCACCATAAAA
>chr29:28491806-28491907
CTGTCTGCAACTTCACAGACGGCGCATTGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACAAGAAAGTTGGAAGATGGGAAATTGGTGGTG
>chr29:28492441-28492494
AATGCGTCATGAACAATGTCACCTGTACGCGGATCTATGAAAAAGTAGAGTAA

My question is that should i add one base at the start of each CDS part or at the start of first CDS part only ?? I'm too much confused. Any idea how to fix it ??

ADD COMMENTlink modified 24 days ago by Alex Reynolds21k • written 26 days ago by adeena_hassan40
1

What genome build is this from?

According to Ensembl FABP5 is a pseudo-gene in Dog (CanFam v.3.1) with 3 exons

..........ctgggcttgctacagcgctgatcatagaatcctcttcaattccagctgga

ATGCTGTGTCAGGCACTTCACAGATTTGGTCAAAAGCTGGTACGCAGACGTACATTGAAG
CAAGATGTGACCCAGATCAGATATTTGAACACACTGGATTTAGTGACCCTGGGTGTGGAC
CACACAGTGGGTATAGGTGTATATGTCCTGGCTGGGGAGGTGATCAGTAATCAAGCAGGA
CCTTCGATTGTGATCTGCTTTTTGGTGGCTGGCCTAGCCTCGTTGTTGGCTGGGCTGTGC
TATGCAGAGTTGAGTATCCGGATTCCTCATGCTGACTCTGCATATGTCTACACCTATGTC
ACTGTAGGTGAACTTGGTGCTTTTGTCACTGGCTGGAACCTCCTCCTCTTCCTTGTTGCT
GATGGAGTTGTGTTGGGTTGGGTGTGGATGTTAATTTTTGACAACCTGCGTGGGGACCAG
ATATCTGAGACCCTGACTGAGAACATTTCATCATATGTTTCCCGTGTCTTTGAAAAATAT
CTAGGCTTCTTTGTTACGTGTTTTGTATTCTTCCTCACTGATTTCTGGTATCTGTGGGTT
TTTGAGTGTTCCCAGATTTCCAAATGGTTCACATTGGTTAAAGTTTTCTTTCTCAGTTTT
GTCATCATCTCTGGCATCATTAAGGGATCTGCGCAACTGGAAGCTCACAGAAGAGGACTA
CGTGAAGGCTGGACTCAATGACACCTCTAGTTGAGCCCTCTGGGCTCTGGAGGATTCATG
CCTTTTGGCTTCCAGGGGATTTTCCGTGGTGCAGCTACCTGCTTCTATGCTTTTGTTGGT
TTTGACAACATCGTGACCAGAGGTAAAGTAACCCAGAATCCCCAGCATTCTATCCCTATG
GGCATTGTGATTTCACTGTTCATCAGCTCTTTGTTGTATTTTGGTATCTCTGCAGCACTT
ACACTTATGGTGCCTTACTACCAGCTTCGACCTGGTAGCCCCTTGCCTGACGTATTTCTC
CATATTGGCTGGGCTCCTGCCTTCTATGTT                              

gtaacttttggatttttctgttttc..........aaatatgtgcatatgtgctttacag

AATAAAACCTCTGAATTAAAAAAAAAAATGGCCAAACCAGATTGTATCATCTCTTCTGAC
GGCAAAAACCTCACCATAAAAACTGAGAGCACTTTGAAAACAACACAGTTTTCGTGTAAT
CTGGGAGAGAAGTTTGAAGAAACTACAGCTGATGGCAGAAAAACTCAGACTGTCTGCAAC
TTCACAGACGGCGCATGGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACA
AGAAAGTTGGAAGATGGGAAATTGGTGGTGGAATGCGTCATGAACAATGTCACCTGTACG
CGGATCTATGAAAAA                                             

gtagagtaaaaattccatcatcatt..........gacctattttggcacattcacccag

GTTGTGGTCATCGTGATCATTTGTGTTATTGCAGCAGTCATGACATTCTTCTTTGGACTC
ACTTATCTTGTGGACCTCAGTGCAATTGGGTCCCTGACACCTCACTCTCTTGATGCTATT
TGTGTACTCATCCTCAAGTATCAGCCTGAGAAGAAGAATGAGTGA               

aatgaagcacaggtactggaggagaatgggcctatggcagagaagctgac..........
ADD REPLYlink modified 24 days ago • written 24 days ago by genomax34k

FABP5 is not mentioned as pseudogene according to information given in NCBI for dog genome.

ADD REPLYlink written 24 days ago by adeena_hassan40

Are you certain those are CDS features? They don't start with canonincal start codons, nor do they look like they all have stop codons.

ADD REPLYlink written 26 days ago by jrj.healey2.6k

Yes, I'm certain about it. And in the above example all sequences are the parts of a single CDS sequence and there is a stop codon (TAA) at the end of the last part.

ADD REPLYlink written 26 days ago by adeena_hassan40
1

Oh its a single CDS? That's an odd way to depict the sequence. Then to answer your question you should only add an A to the first part of the sequence, where the ATG would be.

ADD REPLYlink written 26 days ago by jrj.healey2.6k

yes, its a single CDS. I have fetches these sequences from whole genome.Actually i'm confused due to these parts I have coordinates file for extracting a Whole CDS like below and this file format is 1-based.

chr29   28363101    28363137    .   .   .
chr29   28491275    28491447    .   .   .
chr29   28491806    28491907    .   .   .
chr29   28492441    28492494    .   .   .

my point is that why to add a base for only first coordinate why not for all parts ???

ADD REPLYlink written 26 days ago by adeena_hassan40
1

Because if, as you say, each sequence is PART of the CDS, and not the CDS itself, genes start with an ATG. If the 0-based numbering affects the sequences afterwards too, you don't know what base needs adding so you can't just put an A in there. You have an additional problem, that if they've all have the 1st position base deleted, you won't know what to replace it with, and if your sequences aren't a multiple of 3 for each, there will be frameshifts in it too.

ADD REPLYlink modified 26 days ago • written 26 days ago by jrj.healey2.6k

Thank u for reply

you don't know what base needs adding so you can't just put an A in there

Actually this is not the problem that what base should add because we can find the correct base by changing the the first coordinate e.g.

first coordinate is 28491275 -> 28491274 so by reducing one we can find correct base. I have tested it for ATG and its always A so i put A there.

But I'm not clear that whether I should add one base for others parts or not ?? have You any idea how can i test it ???

ADD REPLYlink written 26 days ago by adeena_hassan40

I'm still not really seeing the problem - my apologies. Maybe I'm being really stupid.

I'm not sure I can really help you, unless you know whether or not the off-by-one error is affected them all or not a priori. It might be easy to fix depending on your dataset.

The last sequence in your example is around 100 kilobases separated from the first sequence in the sample, so it seems pretty unlikely to me that they're part of the same CDS. I'm no eukaryote expert, but that seems like a lot even taking in to account introns.

Do you have a fasta sequence of the whole, uninterrupted sequence we can see so that we can understand what these sequences represent?

ADD REPLYlink written 26 days ago by jrj.healey2.6k

I have genrated consensus FASTA from BAM file. My BAM file is aligned aginst canfam3.1 so i have downloaded annotation file of canfam3.1 from NCBI and used Coordinates for CDS extarction.

ADD REPLYlink written 25 days ago by adeena_hassan40

If this is a published genome, why not just download the gff or genbank and extract the CDSs from that as 1 continuous sequence?

ADD REPLYlink written 24 days ago by jrj.healey2.6k

No, this is not a published genome.

ADD REPLYlink written 24 days ago by adeena_hassan40

But you said you downloaded it from NCBI?

ADD REPLYlink written 24 days ago by jrj.healey2.6k

ohhhh sorry 4 that :o above example not from published genome.

but i have also tried it for dog genome which is available on NCBI same problem for published genome.

ADD REPLYlink written 24 days ago by adeena_hassan40
2
gravatar for jrj.healey
24 days ago by
jrj.healey2.6k
United Kingdom
jrj.healey2.6k wrote:

If I understand the problem then I'm not sure you can easily resolve this without more information.

Your off-by-one issue presumably affects all the sequences, but we can't know that for sure unless you can't find out from the software authors or something unequivocally, or you have the whole reference sequence to compare to.

If we assume it does,the first base of the middle sequences can be restored by taking the last base from the sequence preceding it. For the first sequence however, you can't simply assume it needs an A base adding. It's likely, but not all genes start with an ATG, there are other possible start codon. It looks like the last sequence still has its stop codon, so the last sequence is probably fine too.

This really doesn't seem like the best way to do this to me. You should really check it against the full sequence though.

ADD COMMENTlink written 24 days ago by jrj.healey2.6k

Thank u so much your answer is helpful.

ADD REPLYlink written 24 days ago by adeena_hassan40
1
gravatar for Alex Reynolds
24 days ago by
Alex Reynolds21k
Seattle, WA USA
Alex Reynolds21k wrote:

BED and my BAM file is 1-based

BED and BAM are usually 0-based, half-open [start-1, end). I'd start there, as errors there can cause grief downstream.

ADD COMMENTlink written 24 days ago by Alex Reynolds21k

Yes, you are right i guess problem is in my file format.

actually i have fetched column 4 and 5 from gff3 (annotation) file and made a bed6 file then i have used bedtools getfasta for getting FASTA sequence.

This is wrong approach. I should have to convert gff to bed then used it for sequence fetching. After testing it for multiple genes i will paste it here.

ADD REPLYlink modified 20 days ago • written 24 days ago by adeena_hassan40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1269 users visited in the last hour