I m trying to parse out exons or mature exons ,what have I done so far I have taken our exon coridnates from the gtf file and then converted them into bed file then used the same bed file coordinates to take out exon sequences from the fasta file using bedtools command getfasta
the command i used is
bedtools getfasta -fi hg19.fa -bed exon.bed -fo -exon_Seq -split
I would be glad if im using the command is correct of not.
So now I have a file with exon sequences the whole genome exon sequences , now how do I parse out the mature exon sequences ,from the file so in the file i have as such
chr1:11871-12227
AACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCA
>chr1:11873-12227
CTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCA
so its just an small set from my data , now how do I decide and find the mature sequences from each gene and after lets say I have a gene with 5 exons, then I want to join the first exon and the last exon to use it downstream for downstream analysis .
So far all the above i used shell script ... I guess R wouldn't be much use ,Perl would be needed to parse out....so how do I do ?
UPDATE
This below output from my coordinates bed file , as i can see that chr1 11868 12227 + exon
and chr1 11871 12227 + exon
and the same with respective information from the gtf file 11868 12227 exon:ENST00000456328.2:1 .
from this i can understand that the respective gene has two exons , so I have used this exon coordinates and used the bedtools getfasta to take out the respective exon sequences ,now I want to join the coordinates 11868 12227 11871 12227
this will give me mature sequence , so as an example I showed this , like this it will have hundreds of exon coordinates with 1 or more than 1 exon ,for a given gene ,Now how do i parse out the mature coordinates and join the first and last exon of each gene and get a sequence .
cat coordinates.bed | head -5
chr1 11868 12227 + exon
chr1 11868 14409 + transcript
chr1 11868 14412 + gene
chr1 11871 12227 + exon
chr1 11871 14412 + transcript
cat annotation.bed | head -1
chr1 11868 12227 exon:ENST00000456328.2:1 . + HAVANA exon . ID=exon:ENST00000456328.2:1;Parent=ENST00000456328.2;gene_id=ENSG00000223972.4;transcript_id=ENST00000456328.2;gene_type=pseudogene;gene_status=KNOWN;gene_name=DDX11L1;transcript_type=processed_transcript;transcript_status=KNOWN;transcript_name=DDX11L1-002;exon_number=1;exon_id=ENSE00002234944.1;level=2;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1;tag=basic
Any help or suggestion would be highly appreciated.