Question: Transcriptome Assembly Quastions
0
gravatar for tawares07
19 months ago by
tawares070
Brazil
tawares070 wrote:

Hi guys!

I have been using transcriptome assembly (genome-guided) to identify novel alternative splicing transcripts in human transcriptome. After the execution of mapping and assembly, I had some questions that may improve or reduce some "noise" in my results:

1) For mapping, do you use scaffolds and chrM or only chr1-chr22,chrX and chrY?

2) In my GENCODE GTF file I have annotations from both mRNAs and non-coding RNAs. Do you remove annotations from non-coding RNAs?

3) For transcritome assembly (in my case, StringTie), what is the minimum coverage or depth to consider a transcriptome assembled?

I would be glad if you could shared your experience I help me to improve my research.

Best, Raphael

ADD COMMENTlink modified 19 months ago by Kevin Blighe41k • written 19 months ago by tawares070
0
gravatar for Kevin Blighe
19 months ago by
Kevin Blighe41k
Kevin Blighe41k wrote:

Olá Raphael, boa tarde (eu falo português da forma fluente)

1) For mapping, do you use scaffolds and chrM or only chr1-chr22,chrX and chrY?

If you have no intention of researching chrM, other scaffolds, or the sex chromosomes, then you can justify removing them - it depends on what your aims are. However, won't StringTie then try to assemble them anyway (if reads from these chromosomes are in your data)? It depends on the behaviour of StringTie when you use a genome-guided assembly.

I note that StringTie, if you supply a reference GTF file, will normalise counts over the GTF transcripts. This normalisation process will be influenced by the presence of a chrM, X, Y, etc., but only slightly. For raw coverage (raw counts), it makes no difference, as it would then be just counting reads over each position (and not normalising them).

Take a close look at the -x parameter of StringTie:

-x <seqid_list> Ignore all read alignments (and thus do not attempt to perform transcript assembly) on the specified reference sequences. Parameter <seqid_list> can be a single reference sequence name (e.g. -x chrM) or a comma-delimited list of sequence names (e.g. -x 'chrM,chrX,chrY'). This can speed up StringTie especially in the case of excluding the mitochondrial genome, whose genes may have very high coverage in some cases, even though they may be of no interest for a particular RNA-Seq analysis. The reference sequence names are case sensitive, they must match identically the names of chromosomes/contigs of the target genome against which the RNA-Seq reads were aligned in the first place. source: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

2) In my GENCODE GTF file I have annotations from both mRNAs and non-coding RNAs. Do you remove annotations from non-coding RNAs?

It is no problem keeping the ncRNAs. They are genes like every other gene, the only difference being that they have a single exon. Any good transcriptome assembler will be able to distinguish the boundary between one gene and another.

3) For transcritome assembly (in my case, StringTie), what is the minimum coverage or depth to consider a transcriptome assembled?

Do you mean average coverage across an entire transcriptome or coverage over an individual transcript? Transcriptome assembly with TopHat or StringTie is different from that of other assemblers like Velvet/Oases because you typically use a reference genome FASTA and GTF with TopHat/StringTie. The key parameters are -j -c and -B

Boa sorte cara!

Abraços Kevin

ADD COMMENTlink written 19 months ago by Kevin Blighe41k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 965 users visited in the last hour