Is it possible to extract the 'intronic' sequences from a trancriptome obtained from Biomart using the Biostrings package (R)? I also have the exon and UTR sequences available.
1
1
Entering edit mode
5.0 years ago
vladimiralan ▴ 20

Using the biomaRt R package I was able to obtain the transcriptome of an organism as well as all its exons and UTR sequences. Now, using the Biostring package I managed to process these objects so that I can write them as a fasta file; however, I also need to get the intronic sequences and there is no way to do that directly from biomaRt.

Assuming that whatever is present in the transcriptome fasta file that is not present in the exons and UTR fasta files is an intron, is there a way I can extract that from the transcriptome fiasta file using the Biostrings library? I can work with either the fasta files I managed to write, or with the object as it comes as output from the getBM() function... so whatever is easier to work with I'll work with.

If there is no way I can do that... what do you guys think I should do?

*Should I learn and use bedtools instead? Can I use the ranges obtained from Biomart to extract them that way?

*Should I learn and use TxDb instead? I wonder if I can use the data as I have it right now.

*Should I be using base R instead? Something like gsub to mask or delete the exons and UTRs?

Thanks in advance! Any input is totally appreciated.

R genome sequencing sequence RNA-Seq • 2.2k views
ADD COMMENT
1
Entering edit mode

Please consider shortening the title so it is more concise.

ADD REPLY
0
Entering edit mode

Are you able to get annotation or at least make a basic file from what you have?
Get intronic and intergenic sequences based on gff file
Get Introns from a gft annotation file

ADD REPLY
0
Entering edit mode

Hey! thanks for replying! Sorry about the title... What do you mean by a 'basic file'? I wrote my script so I can download all the transcripts, exons and UTR sequences of an organism in a multi fasta file (well, actually 3 files, one for each biotype), but I mean, I obtain the data from ensembl by using the biomaRt package in R; so yes, I think I could get the annotation (I suppose I only need to learn how to write a GTF/GFF file from biomaRt then).

So you'd say I should take any of the approaches above instead of trying to do something with Biostrings?

ADD REPLY
1
Entering edit mode

I am not familiar with Biostrings but the answers above will work if you are able to get a GTF/GFF file easily I am not sure what organism you are working with but they may already be available via Ensembl genomes pages.

ADD REPLY
0
Entering edit mode

Thank you so much for your response! and yeah, it seems like those strategies you linked are the way to go.

You see, I am just an undergraduate student starting to learn R and the Bioconductor library, and just recently started using the biomaRt package. When I found you cannot directly get the intronic sequences of an organism with this package, I decided to try and figure a way to do it with this other new library I discovered (Biostrings), but had no success doing so as of now. So yeah I don't really 'have' a particular organism, just trying to figure out the capabilities and limitations of these wonderful tools. This forum seems to be more focused on solving some specific-real-world bioinformatic problems, so I apologize if this was not the place to ask my question.

Once again, I trully appreciate the time and consideration you took by replying to this question.

ADD REPLY
0
Entering edit mode
4.8 years ago
ATpoint 82k

Not R but all you need is the intergenic regions. Intron will then be the complement of merge between exons and intergenic: A: how to get intronic and intergenic sequences based on gff file?

ADD COMMENT

Login before adding your answer.

Traffic: 1285 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6