Closed:InsideDNA: Evaluating which species and genes are most sequenced within a taxon range in NCBI database with geneCoverage
0
0
Entering edit mode
7.8 years ago

One of the common burdens for evolutionary biologists dealing with phylogeny reconstruction is supplementing newly sequenced data with sequences already available in GenBank. This exercise is particularly common when one would like to build a large(r) phylogenetic tree. Here we present a small pipeline of two tools - geneCoverage and geneCoverage2fasta – which allows to automate two critical steps for such tasks: evaluation of gene coverage for a given taxon (i.e. how many unique species were sequenced for different genes/gene products) and automatic retrieval of the most represented sequences via BLAST from the GeneBANK database in fasta format.

For simplicity, let’s consider following case: you have sequenced several species from Balanophoraceae plant family for 5.8S ribosomal RNA and 18S ribosomal RNA products and now would like to:

  • evaluate which other Balanophoraceae species were sequences and for which genes
  • download these sequences from GenBANK in fasta format and combine downloaded sequences with the de novo sequences
  • align sequences and prepare gene matrix for all obtained genes (combined)
  • reconstruct a phylogeny for entire Balanophoraceae family

Not to blow the length of this tutorial, we will cover here step 1, and steps 2, 3 and 4 steps are going to be discussed in the next tutorials (so, subscribe to our newsletter).

Evaluate which Balanophoraceae species were sequences and for which genes

1. Registration.

Create an account on the InsideDNA website. Once you get a confirmation email, remember that you will need to fill a small survey (at most 3 minutes) and then you are ready to go.

2. General overview of InsideDNA

In the platform you have three main bookmarks in the top navigation menu: Tools, Tasks and Files. Their purpose in quite self-explanatory, but in principle:

  • In Tools you can search for bioinformatics tools, organize them into projects and run;
  • In Tasks you can monitor progress of your submitted tasks and do some operations on them;
  • In Files you can manage all your data pretty much like in any normal desktop file manager

3. Create a new Project.

First, let’s create a new empty project by clicking on +Add new project.

Name it Balanophoraceae.

Second, we are going to add two tools into the project. Type in the search box: geneCoverage, then click on add button and choose Balanophoraceae project in the dropdown list.

Do the same for geneCoverage2fasta tool.

4. Initialize a task

Next, we simply select Balanophoraceae project and click Run Tool on the geneCoverage tool. Here you can also obtain more information about the tool by pressing Read More button.

5. Specify Tool settings

Once you clicked on Run Tool, you will have a Tool Settings menu opened. Here you need to specify the Task name, tool parameters and so-called queue (read more about Queues). Then you will need to preview the task and submit it.

Specify the task name which is easy for you to recognize later on. For instance, Balan_geneCov

geneConverage is a very simple tool to launch. You only need to specify a taxon range for which you’d like to evaluate gene coverage and output folder to store the resulting files. We will specify here: Balanophoraceae as a taxon range.

By clicking on Browse button, we will open a mini File Manager (miniFM) that facilitates the choice of input/output files/folders.

As we have not yet created any folder for the Balanophoraceae project in our account, we will do it right in the miniFM. Click plus button:

And make a new folder called Balanophoraceae_project. This will be our working directory

Select this folder. The output files will be automatically placed in this directory.

It is a good idea to always preview the task – this way you can have a quick check of the specified parameters and also familiarize yourself with a “command line” way of doing bioinformatics.

If you are satisfied with the task settings, you will need to select a queue for the task. Different Queues provide different computing capacities and it is good idea to start with a smaller queue – for example, in our case it will be sufficient to either launch task in Micro or Normal queue (you can read more about Queues here).

Once you selected a Queue, simply click on Submit task button. Once submitted, you can either go straight to the Task monitoring or Stay on the current page. Staying on the current Tools Settings is useful when you need to submit multiple similar tasks where you just modify input/output data and some parameters (aka “manual batch processing”). But for now click on View submitted task

6. Monitoring task progress.

After task submission you will see task in Task manager:

In Task bookmark you have several options:

1) Tasks are grouped by their status. Typically, an error-free task first appears in Running group and then automatically moves to Completed group. Sometimes, if you submit more than 5 tasks at once, extra tasks will be put on hold on (Suspended group) – because right now you can run simultaneously 5 tasks. The Suspended tasks will be launched as soon as one of the Running tasks is completed. If something goes wrong, you will have a task in Failed group. There is always a couple of ways to check what went wrong.

2) You can always interact with the submitted tasks by clicking on the buttons at the top menu. Same menu, but as a dropdown will appear if you click on the arrow down button

3) You can always preview your tasks for error log, Tool settings and general parameters such as type of queue at submission times.

Our task first appears in the Running group and will be there for a couple of minutes. Once done – it is moved to Completed group and we can verify that nothing went wrong by looking at the error log in the right panel.

7. Obtaining the files

Now, let’s move to File Manager (FM). Click on Files and navigate into Balanophoraceae_project directory. Here you will see all files associated with the task. There are two files – original genbank file and a csv table. We are interested in csv table. Let’s first preview what is inside of the file by clicking on the preview button on the right.

Looks like we got some interesting summary about gene coverage

Now, let’s download the file and play with it in R. Click on Download button

8. Understanding the output and summarizing the gene coverage with R

Now we know which genes/gene products were sequenced for Balanophoraceae and would like to simply count how many unique species were sequenced for each unique gene/gene product. Important to remember, that we are building a species-level phylogeny – therefore, we will only count unique species. Then we can select so-called reference sequences for BLAST to fasta stage. To do that we need a small helper function in R. Copy the code below and paste it into R:

Do not forget to set a working directory where you have downloaded Balanophoraceae.csv file (setwd command in R)

From the summary, we can conclude that:

1) the 3 top sequenced gene products are

  • 18S ribosomal RNA (15 unique species)
  • 28S ribosomal RNA (7 unique species)
  • 5.8S ribosomal RNA (7 unique species)

2) the reference sequences for these 3 gene products are:

L24044.1 (18S) JN392881.1 (28S and 5.8S rRNA have same reference)

3) we can prepare an input file for our next BLAST stage to get fasta sequences and ready-to-build phylogeny format. To do this, just make a small tab-separated text file as follow:

4) Upload this tab-separated file (name it Balan_id.txt) back into the InsideDNA application, by clicking on Upload button in FileManager.

insideDNA genomics • 2.0k views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 2607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6