Question: Exon coordinates and sequence
0
gravatar for krushnach80
2.2 years ago by
krushnach80630
krushnach80630 wrote:

I have this problem so i want to take out all the exon coordinates and their respective sequences .One way is I take out the coordinates of exon from a gtf file then give that as input to ucsc genome browser , i get the DNA sequences ,but it looks like i get intronic sequences too.

So any suggestion or help how to do it better way .I m not how to proceed .As always sugesiton or help would highly appreciated

rna-seq • 1.3k views
ADD COMMENTlink modified 2.0 years ago by sacha1.8k • written 2.2 years ago by krushnach80630

Hi where you gtf file come from ? There is few possibility for this error like not the same NM gene/transcript version so the coordinates from gtf files and ucsc are not the same.

ADD REPLYlink written 2.2 years ago by Titus910

i have the gtf file from gencode ...so what do you suggest?

ADD REPLYlink written 2.2 years ago by krushnach80630
1

You can extract coordinates of exon from ucsc (https://genome.ucsc.edu/cgi-bin/hgTables ) to be sure you compare the same ENST

ADD REPLYlink written 2.2 years ago by Titus910

Have you looked at bedtools getfasta? This does what you are looking for if I understood the question correctly.

ADD REPLYlink written 2.2 years ago by WouterDeCoster42k

no i haven;t looked into it so i have this problem which is as such , I want to get the exon coordinates and the respective sequence ,then i want to use that exon sequence for primer designing.So is it possible for the bedtools getFasta to help in that?

ADD REPLYlink written 2.2 years ago by krushnach80630
1

You need a gtf, gff or bed of the exons of interest. That shouldn't be too hard to find. Then bedtools getfasta can be used to slice out a fasta record per exon (or any other interval) which you can then use for primer design.

ADD REPLYlink written 2.2 years ago by WouterDeCoster42k

okay so you saying i need the exon coordinates and using those i can slice fasta sequence from a fasta file ?

ADD REPLYlink written 2.2 years ago by krushnach80630
1

Yes, you use the reference genome fasta and slice out the fragments of interest from the bed/gtf.

ADD REPLYlink written 2.2 years ago by WouterDeCoster42k

thank you will try your suggestion

ADD REPLYlink written 2.2 years ago by krushnach80630

I have extracted the field containing chr, start, end , +/-, and the column containing exon/transcript/cds . I have a made field , now i want to extract only those rows which contain exon only , I did it R its straight forward but i'm not sure if I can retain the bed format in R or not so how do i parse only the exon only ?from the bed file...

ADD REPLYlink written 2.2 years ago by krushnach80630

its simple awk so i did it anyways thank you for your input im using that only

ADD REPLYlink written 2.2 years ago by krushnach80630
3
gravatar for sacha
2.0 years ago by
sacha1.8k
France
sacha1.8k wrote:

I did it like that:

1- Download refGene.txt.gz and hg19.fasta from the UCSC goldenpath. ( note: convert hg19.2bit to hg19.fa using twoBitToFa )
2- Create a bed file with exon coordiniate using my awk script

 // to_transcript.awk 

BEGIN 
{
 OFS ="\t"
}
{
    name=$2
    name2=$13
    sens = $4 =="+" ? "forward" : "reverse"
    chrom=$3

    split($10,exon_starts,",")
    split($11,exon_ends,",")


    for (i=1; i<length(exon_starts); i++)
    {
        start = exon_starts[i];
        end   = exon_ends[i];
        print(chrom, start, end, name2,sens, name)
    }

}

And run it :

  zcat refGene.txt.gz|awk -f to_transcript.awk > exon.bed

3 - Extract sequence using bedtools

  bedtools getfasta -fi hg19.fa -bed exon.bed -fo exon.fa -name

4 - You may want to reverse sequence for gene placed on the reverse strand. Split exon.bed into and exon_minus.bed and exon_plus.bed using grep on column 5th. And do same as I described above to have exon_minus.fa and exon_plus.fa You can then revert your sequence using :

   seqtk seq -r exon_minus.fasta> exon_minus.rev.fasta

and concat both file :

   cat exon_plus.fasta exon_minus.rev.fasta > all_exons.fa
ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by sacha1.8k

will try it thank you for making the script and let you know

ADD REPLYlink written 2.0 years ago by krushnach80630

@sacha

name=$2
    name2=$13
    sens = $4 =="+" ? "forward" : "reverse"
    chrom=$3
    exon_count=$9
    tx_start=$5
    tx_end=$6

    split($10,exon_starts,",")
    split($11,exon_ends,",")

as i use shell scripting for very basic purspose like write one liner but never use loop as such mostly i use R , so what i understand you are defining each field such as name=$2 is that what you are doing in the script , let me know if I'm understanding your script ...I would be glad if you can explain me..the awk part rest I used like bedtools but not seqtk...

ADD REPLYlink written 2.0 years ago by krushnach80630
1

I read the column description from ucsc table : https://genome.ucsc.edu/cgi-bin/hgTables. ( look for NCBI refseq and click on "describe table shema" )
As you see, each column correspond to :
name : transcript name
name2 : gene name
sens : gene direction
chrom: the chromosome where the gene is
exon_count : how many exon they are
tx_start : position of transcript's start
tx_end: position of transcript's end
exon_starts and exon_ends contains start and end positions of exon separated by comma.
So to get all exons coordinates, you should loop over those last fields. "split" function cut the field by comma and give you an array where you can loop over.

PS : the following statement are unused. You can remove it :
exon_count=$9
tx_start=$5
tx_end=$6

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by sacha1.8k

looks like the bed file exon coordinates i got from your script it doesn;t have tab space if i understand looking at it...

ADD REPLYlink written 2.0 years ago by krushnach80630

after running the awk script when i'm running bedtools to get fasta i get error like it looks like you have less than 3 columns at line :1 .Are you sure your files are tab-delimited?

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by krushnach80630

yes, sorry... I just fixed that above by adding BEGIN { OFS = "\t" }

ADD REPLYlink written 2.0 years ago by sacha1.8k

yes i added in the print command of the last line now it runs perfectly fine and Im able to get the sequence as well using the getfasta command as well so all these sequences are mature exon sequences for each gene isn;t it i don't see any interval when earlier I parsed out all the exon sequences like for 1 gene there might be 3 exon so i get those interval now it's gone ,am i correct?

ADD REPLYlink written 2.0 years ago by krushnach80630
1

One sequence = exon1+exon2+...+exonN

ADD REPLYlink written 2.0 years ago by sacha1.8k

okay so it's the total mature sequence , but as i see the output like this .

>DDX11L1
CTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCA
>DDX11L1
GTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAG

one of the output using your script ..so here i see two DDX11L1 exon i suppose is this the way am I suppose to get output ?or am I doing something wrong

ADD REPLYlink written 2.0 years ago by krushnach80630

There are many transcripts for one gene. You can replace the gene name by the transcript name to have a unique identifier :

    print(chrom, start, end, name2,sens, name)

by

    print(chrom, start, end, name,sens, name)
ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by sacha1.8k

so i wouldn't go for name as coordinates is better for me since my idea of getting the coordinates of all the exon sequence i would go for start and end chromosome only .and then the sequence and join them as one mature sequence i have made a separate post for this like new question here it is have a look

ADD REPLYlink written 2.0 years ago by krushnach80630
2
gravatar for kristoffer.vittingseerup
2.0 years ago by
European Union
kristoffer.vittingseerup3.0k wrote:

If your data is form GENCODE the easiest is simply to download the corresponding FASTA file from gencodeGenes - here is the link to the most recent human release but it is important you select the right release.

Alternatively I've used Cufflinks gffRead to extract the sequences of a GTF file with sucess multiple times.

ADD COMMENTlink written 2.0 years ago by kristoffer.vittingseerup3.0k

i read about the gffRead but i haven't tried it yet

ADD REPLYlink written 2.0 years ago by krushnach80630
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1715 users visited in the last hour