Require Help on Next Gen Sequencing of Spiders SRAs to find and indentify parasites.
1
0
Entering edit mode
14 months ago
Georgina • 0

Hi,

I am currently doing a journal article of the identification of parasites in various spider species using histopathology and next gen sequencing.

I have hit a brick wall with the data collecting side of it as I do not hold the experience or knowledge to do this. The method is to use NCBI to gather spider SRAs, and using serratus.io rdrp run, then enter the SRA to find any viruses. From there, the rdrp file is dowloaded and I put each sequence (which is put into nodes, i dont know what that means) into ncbi nucleotide blast. This is the method I was advised to do but I need to collect that data and put into excel for data collecting to analayse but I have no idea what any of it means or what parts of the information is important. I need a good chunk of data and majority of them come up with zero results too

I would really appreciate it if someone could help me gather spider data sets and analyse them for viruses and identify the viruses... or if theres a better way to do so. Thank You.

ncbi parasitology Next-gen-sequencing genetics • 707 views
ADD COMMENT
0
Entering edit mode

Since you are asking about finding data I will point you to https://sra-explorer.info/. If you type in "spider" as a search term you can get hundreds of hits. These are all datasets (it looks like there is a 1000 spider transcriptome project which may or may not be of interest to you). Way you do this search is detailed in this thread : sra-explorer : find SRA and FastQ download URLs in a couple of clicks

If you have no prior experience with next-gen sequencing data then this is a complex project. Are you working on this as a thesis project? Using public sequence data comes with several caveats since the aim of the public experiment may not exactly align with your interest. You would benefit from discussing with whoever assigned this project to you about strategy and execution.

ADD REPLY
0
Entering edit mode
9 months ago
ababaian ▴ 50

Hey Georgina. I'm a dev on Serratus.io and by chance have been working on a Serratus Virome Tutorial which is related to the type of search you want to do. I'm still a few days away from finishing the analysis in R, but how I would approach here. If you open an issue on our github I can update you when it's done.

E. Virome Analaysis - _Toxoplasma gondii Case Study_

[~1 day] A Virome is a set of viruses which share a common host, environmental, or ecological niche. For example one can describe the "Human Virome", a "Antarctic Soil Virome" or the "Rainforest Virome". As such it is useful to query for all Serratus database viruses which associate with common factor. This tutorial explains how to generate and execute such a search query efficiently.

As an illustrative example, this tutorial will aim to describe the Serratus-associated virome of the unicellular parasite Toxoplasma gondii(NCBI taxid: 5811).

Prerequisites

  • Internet web browser
  • Unix-based command-line
  • pgAdmin (postgreSQL client)
  • Google Cloud Account [Optional]

X Query RNA-dependent RNA Polymerase (RdRP) sequences X Serratus microassemblies: rdrp1.mu.fa (3.6 GB) X DIAMOND (v2.0.9+) X muscle: multiple sequence alignment

XXXXDownload Tutorial E Files

1. Defining "Virome Search Criteria"

There is not a perfect means to define a Virome Inclusion and Exclusion criteria. What is important is to define the criteria in a systematic, unbiased and reproducible manner.

A) SRA Taxonomy Search

Every SRA sequencing run is associated with a limited set of "run-specific meta-data" called the SRARunInfo. This contains a column called scientific_name which is a human-written label for the sequencing run which captures what taxonomic "organism" the dataset is from. To search for Toxoplasma gondii (txid: 5811) we will select its taxonomic parent, the genus Toxoplasma (txid: 5810) to include Toxoplasma sp. YC-2010a, a Toxo which is not categorized under T. gondii.

To search the SRA for such organisms, we can search the SRA:

[`txid5810[Organism:exp]`](https://www.ncbi.nlm.nih.gov/sra/?term=txid5810[Organism:exp])

which on 23/06/28 returned 2,813 datasets. Download this SraRunInfo file (txid5810_SraRunInfo.csv in tutorial package). Downloaded file has 1,856 matching 'runs'.

Figure 1. Taxonomy Search

B) SRA Meta-data Search

Sometimes datasets are not labelled based on the parasite T. gondii, instead the host organism is choosen (Homo sapiens, Mus musculus, ...). To search if T. gondii appears in any meta-data we can search for different spellings/synonyms. Search is exact here.

[`"Toxoplasma gondii" OR "Toxoplasma" OR "T. gondii" OR "T gondii"`](https://www.ncbi.nlm.nih.gov/sra/?term=%22Toxoplasma+gondii%22+OR+%22Toxoplasma%22+OR+%22T.+gondii%22)

which on 23/06/28 returned 6,394 datasets. Download this SraRunInfo file (toxo_SraRunInfo.csv). Downloaded file has 23,531 matching 'runs'.

NOTE: This type of search is more likely to yield False Positive matches for T. gondii, or places where your search returns synonym results. For example this Neospora caninum sample mentions "T. gondii".

Figure 2. Meta-data Search

C) Sequence Content Search (optional)

What if the data-generators were unaware that their sample contained T. gondii? It's possible to search for known-organism genomes using a precomputed hashed-kmer database called NCBI STAT. To use STAT, you must register for a GCP account and follow these instructions.

STAT BigQuery for libraries containing great than 50 Toxoplasma-mapping reads:

    SELECT
      acc,
      tax_id,
      name,
      total_count,
      self_count
    FROM
      nih-sra-datastore.sra_tax_analysis_tool.tax_analysis
    WHERE
      tax_id in (5810)
      AND total_count>50

which on 23/06/28 returned 20,198 datasets. Download this statbigquery file (txid5810_statbigquery.csv).

T gondii STAT The number of T.gondii+ reads detected by STAT in each of the 20,198 returned runs. There are 12,863 runs which contain >1000 reads.

Set Comparison

Set Overlaps. Made with https://bioinformatics.psb.ugent.be/cgi-bin/liste/Venn/calculate_venn.htpl

To err on the side of inclusion for now, we will take the Union of these sets, so 32,601 runs as representing the total T. gondii sequencing set.

D) Set Union and Virome-adjacent controls (optional)

We, unfortunately, now have two different csv tables with slightly varying meta-data. For simplicity we will take the Union of these sets and generate a uniform collection of meta-data so that the data can be processed uniformly.

Depending on your experimental design it may be beneficial to include a Virome adjacent or Virome control datasets in downstream analysis. Briefly, every SRA 'run' belongs to a greater bioproject. One bioproject can group together say 20 runs, and based on meta-data our queries could return only a sub-set of a given bioproject. For example by searching for T. gondii with STAT, in a study containing 20 samples of cat muscle biopsies, only 2/20 may be positive for T gondii, and therefore we would like to include the remaining 18/20 cat biopsies of that study. This will prove useful as a negative control to deplete for non-specific viruses with respect to our query of T. gondii.

#!/bin/bash
# Generate list of all UNIQUE SRA run id from all query sets above
cut -f1 -d',' toxo_SraRunInfo.csv        > tg_run.list.tmp
cut -f1 -d',' txid5810_SraRunInfo.csv   >> tg_run.list.tmp
cut -f1 -d',' txid5810_statbigquery.csv >> tg_run.list.tmp

# Remove header lines
grep -v "^Run" tg_run.list.tmp \
  | grep -v "^acc" - \
  | sort -du - > tg.runlist 

wc -l tg.runlist
# 32601 tg.runlist

This list can now be used as an input to query for all SRA meta-data in big query

#GCP Big Query
    SELECT
      acc, organism, biosample, bioproject, assay_type, librarysource, libraryselection, mbases, mbytes, avgspotlen, releasedate, consent 
    FROM
      `nih-sra-datastore.sra.metadata` 
    WHERE
      acc in ('DRR001705', 'DRR001706', 'DRR002461', ..., SRR9997107', 'SRR9997108')

Saved as tg.SraMetadata.csv returned 32,474 "tg-positive" SRA runs (99.6%) of the search query. From this we will create a list of all 2,586 unique bioproject identifiers (column 4) and re-query for all related runs.

#!/bin/bash
cut -f4 -d', tg_virome.SraMetadata.csv | sort -u - > tg_bioproject.list
#GCP Big Query
    SELECT
      acc, organism, biosample, bioproject, assay_type, librarysource, libraryselection, mbases, mbytes, avgspotlen, releasedate, consent 
    FROM
      `nih-sra-datastore.sra.metadata` 
    WHERE
      bioproject in ('PRJDA33425', 'PRJDA33429', ..., 'PRJNA985929', 'PRJNA988114')

Saved as tg_adjacent.SraMetadata.csv Which yields 668,195 total, or 635,721 "tg-negative" SRA runs. This will serve as a "negative control" for non-specific virus-associations.

2. Generating the "Target Virome" and "Off-target Virome"

Using the "Tg-postive" and "Tg-negative" runs, we will query the Serratus SQL tables to retrieve sOTU which were found within those runs. This molecular barcode approximates RNA virus species and can be used to meaningfully group related sequences when looking for an association.

# Serratus postgres SQL Query
SELECT * FROM public.palm_sra2
WHERE qc_pass = 'true' AND
run_id in ('DRR001701', 
'DRR001702', 
'DRR001703', 
'DRR001704',
...
'SRR9997116', 
'SRR9997117', 
'SRR9997118')
ORDER BY run_id ASC

Saved as tg_adj_U_palmsra.csv which returned 121,907 virus-run observations in 15,256 SRA runs, containing 30,828 distinct virus sOTU.

3. Virome summary statistics (to be continued)

ADD COMMENT

Login before adding your answer.

Traffic: 1278 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6