Question

best pipeline to get gene and transcript counts

0

Entering edit mode

6.7 years ago

lkianmehr ▴ 100

Hello,

I would like to know your opinion about what I did, I am not sure that was really true . I downloaded Hisat2 prepared index mm10, UCSC genome, aligned RNA-seqs on, Next used Featurecounts to count on GTF Comprehensive gene annotation ALL GRCm38 from Genecode. Now I am suspicious to results, do you confirm all steps I did. because I am interested to look transcripts as well as genes. is this pipeline true for that? is not better to use genome_trans Index instead? or do you recommend using another alignment software such as STAR please let me know any useful suggestion

thanks in advance

RNA-Seq alignment count Hisat2 STAR • 4.6k views

ADD COMMENT • link updated 6.7 years ago by t.blasco95 ▴ 30 • written 6.7 years ago by lkianmehr ▴ 100

score 0 · Answer 1 · 2018-10-22

0

Entering edit mode

6.7 years ago

t.blasco95 ▴ 30

My advice would be to use Kallisto algorithm which is fast, easy to implement and provides you detailed transcript quantification tables. Only thing you need is just the transcriptome fasta file and your fastq files.

Regarding to the aligner, STAR shows much more benefits than many other aligners in terms of accuracy and speed, but I have not used it for later gene quantification. However, using Kallisto is easy to relate the transcript names with gene names (by BioMart for example) so you can also estimate the gene counts of your samples.

ADD COMMENT • link 6.7 years ago by t.blasco95 ▴ 30

0

Entering edit mode

thanks for the comment. I know what you mean, actually I did it by Salmon, but I would like to have gene count based on alignment as well. just I am not sure which genome reference and annotation should be used to have all regions including gene and transcripts?

ADD REPLY • link 6.7 years ago by lkianmehr ▴ 100

0

Entering edit mode

Did you align against the genome or transcriptome? If you want to use a traditional aligner to provide input to salmon (rather than using its builtin mapping algorithm), then you should either (1) align to the genome with STAR and tell it to project the alignments to the transcriptome (it can do this for you with the right flags and a gtf), or (2) align to the transcriptome using Bowtie2 (you can get mouse transcript sequences directly from e.g. gencode). Ypu can, of course, always use salmon's built in mappong functionality directly (providing a transcript reference), in whi ch case i'd always recommend grabbing the latest version and turning on --validateMappings. Finally. You can get gene level features using tximport (https://bioconductor.org/packages/release/bioc/html/tximport.html). Actually, for mouse, tximeta (https://github.com/mikelove/tximeta) may be even easier and should be able to pull txp to gene mappings down for you with minimal effort.

ADD REPLY • link 6.7 years ago by Rob 7.1k

0

Entering edit mode

I have aligned against genome only, but I wan to have all transcript including long non coding RNA then count it by featurecounts for example not only Salmon

ADD REPLY • link 6.7 years ago by lkianmehr ▴ 100

0

Entering edit mode

I see. In that case, perhaps consider aligning with STAR, which can give you both the genome and transcriptome alignments at the same time. The genome alignments can be counted using featureCounts, and the transcriptome alignments can be fed to salmon for transcript-level quantification. Also, it's completely possible to quantify any transcripts you want using salmon directly, just include them in the input transcriptome file. For example, if you take the comprehensive transcript set from gencode, it includes long non-coding RNAs.

ADD REPLY • link 6.7 years ago by Rob 7.1k

0

Entering edit mode

yes, you are right, I did transcript quantification on Salmon directly, then import it to tximport and continued with DE analysis. but I am afraid if need to have transcript or gene quantification based on alignment because in most of the articles report genes based on alignment. also, that would be needed for visualizing.

ADD REPLY • link 6.7 years ago by lkianmehr ▴ 100

1

Entering edit mode

In this case, as I suggested, the easiest path is to align with STAR and keep both the genome alignments and get the transcriptome alignments. As swbarnes2 mentions below, you can even use STARs --quantMode for the genome-level counts. Note, however, that you won't be able to get transcript level quantifications from featureCounts (or STAR --quantMode). Also, aggregating transcript-level abundances to the gene level is likely to be more accurate (even for gene level analysis) than counting approaches like featureCounts / htseq-count.

ADD REPLY • link 6.7 years ago by Rob 7.1k

0

Entering edit mode

Excuse me, to that purpose I should make STAR index for both genome and transcriptome? if so may I use fasta files for each one, for example, GRCm38.primary_assembly.genome.fa.gz and transcripts.fa.gz and one Comprehensive gene annotation from Gencode should be used. am I right?

ADD REPLY • link 6.7 years ago by lkianmehr ▴ 100

1

Entering edit mode

No. STAR has the ability to perform a single alignment against genome (and gtf), and to provide 2 separate outputs using the same alignments. One BAM file contains the spliced alignments to the genome. The other BAM file contains the unspliced alignments to the transcriptome (suitable to provide to salmon). You should comsult the STAR documentation to see the command line parameters to feed to STAR to do this. The last thing yoi will need is the transcript sequence corresponding to the gtf you used. You can do this with the genome sequence and the tool gffread.

ADD REPLY • link 6.7 years ago by Rob 7.1k

0

Entering edit mode

Sorry I have one another problem with STAR, since I want to run several reads to be aligned at the same time, faced with this error Possible cause 1: not enough RAM. Check if you have enough RAM 28710272120 bytes Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 28710272120 meanwhile defined separate --outFileNamePrefix for each one, do you have any idea? beacause takes long time to run one by one!

ADD REPLY • link 6.7 years ago by lkianmehr ▴ 100

1

Entering edit mode

You can reduce the memory required by star by building the index with the --genomeSAsparseD parameter set to a value greater than 1. This samples the suffix array. If the value is too large, STAR will become slower, but small values (e.g. 2/4) can provide a nice memory savings without slowing things down too much. Beyond that, I'd ask for recommendations on the STAR user group. Alex, the developer and maintainer of STAR, is usually very responsive.

ADD REPLY • link 6.7 years ago by Rob 7.1k

0

Entering edit mode

I did it successfully! thanks

ADD REPLY • link 6.7 years ago by lkianmehr ▴ 100

0

Entering edit mode

but changing --genomeSAsparseD to 4 also did not generate any difference, and can not align several at the same time! I have to do it separately!

ADD REPLY • link 6.7 years ago by lkianmehr ▴ 100

0

Entering edit mode

Newer versions of STAR will output gene counts for you with the --quantMode option. This should mimic the behavior of htseq-count. https://github.com/alexdobin/STAR/blob/master/RELEASEnotes.md

ADD REPLY • link 6.7 years ago by swbarnes2 15k