Question: best pipeline to get gene and transcript counts
0
gravatar for lkianmehr
10 months ago by
lkianmehr30
France
lkianmehr30 wrote:

Hello,

I would like to know your opinion about what I did, I am not sure that was really true . I downloaded Hisat2 prepared index mm10, UCSC genome, aligned RNA-seqs on, Next used Featurecounts to count on GTF Comprehensive gene annotation ALL GRCm38 from Genecode. Now I am suspicious to results, do you confirm all steps I did. because I am interested to look transcripts as well as genes. is this pipeline true for that? is not better to use genome_trans Index instead? or do you recommend using another alignment software such as STAR please let me know any useful suggestion

thanks in advance

ADD COMMENTlink modified 10 months ago by t.blasco9530 • written 10 months ago by lkianmehr30
0
gravatar for t.blasco95
10 months ago by
t.blasco9530
t.blasco9530 wrote:

My advice would be to use Kallisto algorithm which is fast, easy to implement and provides you detailed transcript quantification tables. Only thing you need is just the transcriptome fasta file and your fastq files.

Regarding to the aligner, STAR shows much more benefits than many other aligners in terms of accuracy and speed, but I have not used it for later gene quantification. However, using Kallisto is easy to relate the transcript names with gene names (by BioMart for example) so you can also estimate the gene counts of your samples.

ADD COMMENTlink written 10 months ago by t.blasco9530

thanks for the comment. I know what you mean, actually I did it by Salmon, but I would like to have gene count based on alignment as well. just I am not sure which genome reference and annotation should be used to have all regions including gene and transcripts?

ADD REPLYlink written 10 months ago by lkianmehr30

Did you align against the genome or transcriptome? If you want to use a traditional aligner to provide input to salmon (rather than using its builtin mapping algorithm), then you should either (1) align to the genome with STAR and tell it to project the alignments to the transcriptome (it can do this for you with the right flags and a gtf), or (2) align to the transcriptome using Bowtie2 (you can get mouse transcript sequences directly from e.g. gencode). Ypu can, of course, always use salmon's built in mappong functionality directly (providing a transcript reference), in whi ch case i'd always recommend grabbing the latest version and turning on --validateMappings. Finally. You can get gene level features using tximport (https://bioconductor.org/packages/release/bioc/html/tximport.html). Actually, for mouse, tximeta (https://github.com/mikelove/tximeta) may be even easier and should be able to pull txp to gene mappings down for you with minimal effort.

ADD REPLYlink written 10 months ago by Rob3.4k

I have aligned against genome only, but I wan to have all transcript including long non coding RNA then count it by featurecounts for example not only Salmon

ADD REPLYlink written 10 months ago by lkianmehr30

I see. In that case, perhaps consider aligning with STAR, which can give you both the genome and transcriptome alignments at the same time. The genome alignments can be counted using featureCounts, and the transcriptome alignments can be fed to salmon for transcript-level quantification. Also, it's completely possible to quantify any transcripts you want using salmon directly, just include them in the input transcriptome file. For example, if you take the comprehensive transcript set from gencode, it includes long non-coding RNAs.

ADD REPLYlink written 10 months ago by Rob3.4k

yes, you are right, I did transcript quantification on Salmon directly, then import it to tximport and continued with DE analysis. but I am afraid if need to have transcript or gene quantification based on alignment because in most of the articles report genes based on alignment. also, that would be needed for visualizing.

ADD REPLYlink written 10 months ago by lkianmehr30
1

In this case, as I suggested, the easiest path is to align with STAR and keep both the genome alignments and get the transcriptome alignments. As swbarnes2 mentions below, you can even use STARs --quantMode for the genome-level counts. Note, however, that you won't be able to get transcript level quantifications from featureCounts (or STAR --quantMode). Also, aggregating transcript-level abundances to the gene level is likely to be more accurate (even for gene level analysis) than counting approaches like featureCounts / htseq-count.

ADD REPLYlink written 10 months ago by Rob3.4k

Excuse me, to that purpose I should make STAR index for both genome and transcriptome? if so may I use fasta files for each one, for example, GRCm38.primary_assembly.genome.fa.gz and transcripts.fa.gz and one Comprehensive gene annotation from Gencode should be used. am I right?

ADD REPLYlink modified 10 months ago • written 10 months ago by lkianmehr30
1

No. STAR has the ability to perform a single alignment against genome (and gtf), and to provide 2 separate outputs using the same alignments. One BAM file contains the spliced alignments to the genome. The other BAM file contains the unspliced alignments to the transcriptome (suitable to provide to salmon). You should comsult the STAR documentation to see the command line parameters to feed to STAR to do this. The last thing yoi will need is the transcript sequence corresponding to the gtf you used. You can do this with the genome sequence and the tool gffread.

ADD REPLYlink written 10 months ago by Rob3.4k

Sorry I have one another problem with STAR, since I want to run several reads to be aligned at the same time, faced with this error Possible cause 1: not enough RAM. Check if you have enough RAM 28710272120 bytes Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 28710272120 meanwhile defined separate --outFileNamePrefix for each one, do you have any idea? beacause takes long time to run one by one!

ADD REPLYlink written 10 months ago by lkianmehr30
1

You can reduce the memory required by star by building the index with the --genomeSAsparseD parameter set to a value greater than 1. This samples the suffix array. If the value is too large, STAR will become slower, but small values (e.g. 2/4) can provide a nice memory savings without slowing things down too much. Beyond that, I'd ask for recommendations on the STAR user group. Alex, the developer and maintainer of STAR, is usually very responsive.

ADD REPLYlink written 10 months ago by Rob3.4k

I did it successfully! thanks

ADD REPLYlink modified 10 months ago • written 10 months ago by lkianmehr30

but changing --genomeSAsparseD to 4 also did not generate any difference, and can not align several at the same time! I have to do it separately!

ADD REPLYlink written 10 months ago by lkianmehr30

Newer versions of STAR will output gene counts for you with the --quantMode option. This should mimic the behavior of htseq-count. https://github.com/alexdobin/STAR/blob/master/RELEASEnotes.md

ADD REPLYlink written 10 months ago by swbarnes26.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1398 users visited in the last hour