hi all
i am very new to bioinformatics.I wanted to see the expression of single gene from different transcriptomic data of organisms without downloading sra data and without doing assembly .what could be simplest way to find the expression.
thanks in advance
I tried this once, asking the corresponding author for their TSA assembly (made with CLC genomics) described in their published paper, I wanted to check for one gene only. In return, the person asked me for co-authorship in the paper we are preparing, because it took them a long time to collect the RNA.
As the raw reads data was available in SRA, I assembled these instead using Trinity :/
what if I do SRA BLAST against particular SRA experiment for target gene and then
Count up the total reads in a sample and divide that number by 1,000,000 – this will be “per million” scaling factor.
Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving reads per million (RPM)
Divide the RPM values by the length of the gene, in kilobases. and giving RPKM.
Please use ADD COMMENT or ADD REPLY to reply to earlier comments, as such this thread remains logically structured and easy to follow. I moved this post now, but as you can see it's not optimal.
Do I understand correctly that you want to compare the gene-expression quantitatively for single genes over several experiments and very different organisms and different protocols, and that even without an assembly? This can't be done.
What I am trying to do, I will explain it to you, I wanted to see the expression of a gene family. I have identified those superfamily genes from different plant pathogens and then I am looking for expression level of those genes in corresponding organisms using SRA experiment ( pathogen infected samples of hosts or pathogen transcriptomes ). at different time intervals but all intervals are same stress is similar ( biotic stress) only the organisms and pathogen is getting changed.I am not interested in showing the exact value of expression as it can be calculated only using assembled data. I wanted to show only a idea how the expression of these genes is varied in different pathogens at similar time intervals.
I am sorry, but I do not like the idea of generating some semiquantitative estimate (that is possibly what your use of "idea" implies). You cannot base any conclusion on such procedure and it would not be convincing anyone. Yes you could use kmer based methods instead of alignment, but everything requires a transcriptome. There is a way to solve this problem by generating assemblies using e.g. Trinity and then map the reads back, so why not take it?
SRA blast will not help you because it can only do blastn. But to find a gene in distantly related organisms you need at least tblastn. In my opinion there exists no good alternative to downloading the data, best, draft assemblies or to download raw reads and run the assembly yourself using e.g. Trinity on transcriptome shotgun reads. I can post a script that does the automatic download of all draft assemblies given a certain taxid using eutils and sratools.
#!/bin/sh
## usage: fetchAllAssembliesByTaxid.sh <taxid>
## saves the query result in <taxid>.esearch.xml for your reference and further processing
set -u
TAX=$1
RESULT=`esearch -db nuccore -query '((txid'${TAX}'[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))' | \
efetch -format xml | tee ${TAX}.esearch.xml`
ID=`echo $RESULT | xtract -pattern Seq-entry -element Textseq-id_name`
for I in $ID ; do
echo Downloading $I ...
if [ -e $I.fasta ]
then
echo " skipping because file exists."
continue # skip if the file has been downloaded already
fi
fastq-dump -fasta -F $I
done
If you tried the same for raw reads, it would certainly grow in volume and computational requirements. Therefore I would recommend to do a de-novo assembly only on a few hand-picked transcriptomes and use the raw reads. Transcriptomes can be heavily contaminated with RNA from symbionts, ingested material, etc. Therefore, if you find a hit to your gene of interest, you still needed to do a phylogenetic analysis to exclude this possibility.
The simplest method is to email the various authors and ask them to send that data to you...
I tried this once, asking the corresponding author for their TSA assembly (made with CLC genomics) described in their published paper, I wanted to check for one gene only. In return, the person asked me for co-authorship in the paper we are preparing, because it took them a long time to collect the RNA. As the raw reads data was available in SRA, I assembled these instead using Trinity :/
what if I do SRA BLAST against particular SRA experiment for target gene and then Count up the total reads in a sample and divide that number by 1,000,000 – this will be “per million” scaling factor. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving reads per million (RPM) Divide the RPM values by the length of the gene, in kilobases. and giving RPKM.
There are too many confounding factors that cannot be corrected for.
https://www.ncbi.nlm.nih.gov/news/11-19-2013-SRA-BLAST/
Please use
ADD COMMENT
orADD REPLY
to reply to earlier comments, as such this thread remains logically structured and easy to follow. I moved this post now, but as you can see it's not optimal.Do I understand correctly that you want to compare the gene-expression quantitatively for single genes over several experiments and very different organisms and different protocols, and that even without an assembly? This can't be done.
What I am trying to do, I will explain it to you, I wanted to see the expression of a gene family. I have identified those superfamily genes from different plant pathogens and then I am looking for expression level of those genes in corresponding organisms using SRA experiment ( pathogen infected samples of hosts or pathogen transcriptomes ). at different time intervals but all intervals are same stress is similar ( biotic stress) only the organisms and pathogen is getting changed.I am not interested in showing the exact value of expression as it can be calculated only using assembled data. I wanted to show only a idea how the expression of these genes is varied in different pathogens at similar time intervals.
I am sorry, but I do not like the idea of generating some semiquantitative estimate (that is possibly what your use of "idea" implies). You cannot base any conclusion on such procedure and it would not be convincing anyone. Yes you could use kmer based methods instead of alignment, but everything requires a transcriptome. There is a way to solve this problem by generating assemblies using e.g. Trinity and then map the reads back, so why not take it?