Question: How to perform strand-aware (RNA-Seq) mapping using STAR?
0
gravatar for c_u
11 months ago by
c_u280
United States
c_u280 wrote:

Hi,

I believe a similar question has been asked before (Question: Is Star Managing Strand-Specific Data ? ), but in the answer to that, it was mentioned that we don't need to do anything extra in STAR alignment if we have strand-aware RNA-seq data. But at another place (Are any additional parameters for STAR to hand the stranded RNA-Seq data?), the author of STAR mentions that presently STAR doesn't use strand information for mapping, and that if we use a certain option (--quantMode), it provides output for 3 kinds of strandedness).

So which is it? Does STAR perform strand-aware mapping, which we can use downstream to get antisense transcripts, or if not, then what is the best way to achieve that? I am working with Tru-Seq Stranded Total RNA library, which apparently maps to the opposite strand of the genome (opposite to what you would normally expect).

I am primarily interested in finding novel non-coding transcripts in my data

rna-seq • 1.9k views
ADD COMMENTlink modified 11 months ago • written 11 months ago by c_u280
2
gravatar for h.mon
11 months ago by
h.mon31k
Brazil
h.mon31k wrote:

strand-aware RNA-seq data

There is no such thing. There are strand-specific library preparations, though, which generate strand-specific (relative to the mRNA) reads, which is probably what you mean.

Does STAR perform strand-aware mapping, which we can use downstream to get antisense transcripts

You can't get transcripts, sense or antisense - you can get mappings and counts from STAR. What are the downstream analyses you intend to perform?

STAR maps reads to an indexed (most commonly annotated) reference genome, and it doesn't care about transcript strandedness. From these mappings, it can output many different things, which can be used for several downstream analyses. The three most commonly used outputs from STAR are:

  1. read mappings to the genome, in BAM format. By default, strandedness has no importance: STAR maps the reads to the best location, regardless of strand, gene location or whatever. However, the parameter --outSAMstrandField intronMotif - which should be used only if you want to analyse downstream with Cufflinks, and have unstranded RNAseq dara - alters the default behaviour and adds XS strand attribute. Read the section Compatibility with Cufflinks/Cuffdiff from STAR's manual. If you have stranded data, you don't have to change anything in STAR.

  2. read mappings to the transcriptome, in BAM format. Those are to be used with RSEM, mostly.

  3. counts of reads mapping to genes, similar to HTSeq or featureCounts. Here STAR outputs three columns of counts, one for unstranded, one for stranded, and one for reverse stranded.

Most of these points were already clearly made by Alex Dobin in the thread Are any additional parameters for STAR to hand the stranded RNA-Seq data?, and the STAR manual.

ADD COMMENTlink written 11 months ago by h.mon31k

Thanks a lot for the detailed answer. I had a question -

What are the downstream analyses you intend to perform?

I basically want to find novel non-coding transcripts in my data. For this, I use STAR for mapping, then use Stringtie for transcript assembly (since it can generate novel transcripts nicely), and then use Tximport that takes the abundance values from Stringtie and creates counts that can be fed to DESeq2 (chose tximport because it provides a way to integrate Stringtie results with DESeq2).

So, now that you mention that STAR mapping itself is strand-agnostic, when/how can I use the strand info so as to be able to do better counting? (Note that my focus is less on genes and more on (novel) non-coding transcripts). Thanks a lot!!

ADD REPLYlink modified 11 months ago • written 11 months ago by c_u280

Use StringTie quantification. Or use featureCounts with the StringTie GTF to quantify know and novel transcripts. Or extract the StringTie transcriptome to fasta and quantify with Salmon or kallisto.

The following thread might be of interest: Unspecific strand.

ADD REPLYlink written 11 months ago by h.mon31k
2
gravatar for swbarnes2
11 months ago by
swbarnes28.9k
United States
swbarnes28.9k wrote:

STAR does not use strand information for mapping and it will output gene counts for 3 kinds of strandedness. There is no contradiction here. Just run it and look for yourself.

ADD COMMENTlink written 11 months ago by swbarnes28.9k

Thanks a lot Mr. Barnes! I am interested in finding novel non-coding transcripts in my data. Given that STAR doesn't use strand info for mapping, is there a way to use that info to better quantify these novel non-coding transcripts? I just thought using the strand info might improve this quantification

ADD REPLYlink modified 11 months ago • written 11 months ago by c_u280

If STAR doesn't use stand information for mapping, then am I right to assume that it somehow stores the information about the strand of the mapped read, somewhere in the bam file, which can then pass on this information to the quantifier (like Stringtie/featurecounts), which can then do the quantification based on strand information?

ADD REPLYlink modified 11 months ago • written 11 months ago by c_u280

Have you really not read the sam format specifications?

ADD REPLYlink written 11 months ago by swbarnes28.9k

Just did, found that strand information of the read is stored in one of the bitwise flags. Thank you for that! The reason I was asking is that some mappers do map in a strand-aware fashion (tophat2 for example), whereas STAR doesn't. So I was wondering that if STAR doesn't care about the strand of the read, how does it still report the correct strand in the BAM file (I can understand how an aligner that is strand-aware would be able to do that). But maybe I can ask another question for this, I don't want to unnecessarily trouble you with too many comment questions!

ADD REPLYlink written 11 months ago by c_u280

Do you have clear in your mind the difference between the strand STAR reports and the strand from strand-specific library preparations?

STAR is reporting the strand the read is mapped onto the reference genome.

Stranded library preparations select a specific strand from the mRNA to be sequenced (and, if performing paired-end sequencing, the second read will be on the opposite strand from the first read).

One is different from the other. One can use the strand STAR reports together with the information from which strand (in relation to the reference genome) the gene is transcribed to quantify strand-specific expression - this is exactly what featureCounts and HTSeq do.

edit: I think the post Antisense transcription from strand specific RNA-seq may be very pertinent to you.

ADD REPLYlink modified 11 months ago • written 11 months ago by h.mon31k
2
gravatar for benformatics
11 months ago by
benformatics2.0k
ETH Zurich
benformatics2.0k wrote:

The mapping is independent of the quantification.

The quantification is stranded.

Counting number of reads per gene. With --quantMode GeneCounts option STAR will count number reads per gene while mapping. A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the pairedend read are checked for overlaps. The counts coincide with those produced by htseq-count with default parameters. This option requires annotations (GTF or GFF with –sjdbGTFfile option) used at the genome generation step, or at the mapping step. STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:

column 1: gene ID

column 2: counts for unstranded RNA-seq

column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)

column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

Select the output according to the strandedness of your data. Note, that if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads. With --quantMode TranscriptomeSAM GeneCounts, and get both the Aligned.toTranscriptome.out.bam and ReadsPerGene.out.tab outputs.

You should look at column 4 for your counts (i.e. anti-sense reads to the given genomic feature).

ADD COMMENTlink written 11 months ago by benformatics2.0k

Thanks a lot! I am interested in finding novel non-coding transcripts. In that case, is it still column 4 that has all the relevant counts? I plan to use Stringtie with the BAM files to do transcript assembly, followed by tximport for generating counts for DESeq2. In that case, where, in this pipeline do I need to do anything to be able to use the extra info that a strand-specific library provides? Thanks a lot!

ADD REPLYlink written 11 months ago by c_u280
1

If you are searching for novel non-coding transcripts than doing this type analysis is not the right way to do it.

You would need to assemble a transcriptome. If you want to search for novel anti-sense transcripts then you could take a look at the non-reverse strand reads (column 3). If you are simply searching for non-coding RNA that are already annotated than you should stick with column 4.

ADD REPLYlink written 11 months ago by benformatics2.0k

That was very helpful, thank you! Why do you think this type of analysis is not the right way to go about it? You mention assembling a transcriptome, and I am basically using Stringtie to do exactly that. Then, once we have a unified transcriptome that represents all the samples, Stringtie gives abundance values for all those transcripts in each sample (https://www.nature.com/articles/nprot.2016.095#procedure), and then I use tximport to export those abundance values to counts that DESeq2 can use.

I specifically used Strigtie because it can assemble novel transcripts too while doing the transcript reconstruction. Is there anything wrong in this method?

ADD REPLYlink written 11 months ago by c_u280

I was assuming your reference was a standard one (e.g. UCSC/Ensembl) but if you did the assembly correctly then it should work.

ADD REPLYlink written 11 months ago by benformatics2.0k

One quick question - If STAR doesn't use stand information for mapping, then am I right to assume that it somehow stores the information about the strand of the mapped read, somewhere in the bam file, which can then pass on this information to the quantifier (like Stringtie/featurecounts), which can then do the quantification based on strand information?

ADD REPLYlink modified 11 months ago • written 11 months ago by c_u280
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2100 users visited in the last hour