Question: Question about multi FASTA input for eXpress
1
gravatar for bharata1803
4.1 years ago by
bharata1803420
Japan
bharata1803420 wrote:

Hello,

I want to ask what is the meaning of no. 1 of this explanation from eXpress website?

eXpress requires two input files:

  1. A multi-FASTA file containing the transcript sequences. If the transcriptome of your organism is not annotated, you can generate this file from your sequencing reads using a de novo transcriptome assembler such as Trinity, Oases, or Trans-ABySS. If your organism has a reference genome you can assemble transcripts directly from mapped reads using Cufflinks. If your genome is already annotated (in GTF/GFF), you can generate a multi-FASTA file using the UCSC Genome Browser by uploading your annotation as a track and downloading the sequences under the "Tables" tab.

  2. Read alignments to the multi-FASTA file in SAM or BAM format. These can either be stored in a file or streamed directly from an aligner. It is important that you allow as many multi-mappings as possible. You can also allow many mismatches during mapping since eXpress builds an error model to probabalistically assign the reads, although this will increase mapping time. If you are combining reads from several library preparations or from sequencing runs using different read lengths, please see the Manual for important details on how the alignments should be input.

I want to use human genome Hg38. I download the genome sequence in fasta format and gene annotation file in GTF format. I don't know what to do becuase it seems using these 2 files is not working for eXpress. In no. 1 also said I can upload to UCSC Genome Browser and download the sequences but I don't know how to do it. So, what file I should use for the multi-fasta file for no. 1? And besides that, can someone explaoin what is the difference between transcriptome sequence and genome sequence? Thank you in advance 

rna-seq express • 1.8k views
ADD COMMENTlink written 4.1 years ago by bharata1803420

You will need the fasta file for the transcripts, not the genome. You can get the cDNA (transcript) fasta here from ensembl

The main difference of the transcriptome and genome sequence is that the genome contains information of the genome as is e.g. Sequences are separated by chromosome / contigs. Whereas for transcriptome, each entry is an individual transcript that can be observed in the genome.

For example, if we have the following in the genome:

Gene A                | Exon 1 |--------| Exon2 |--------| Exon3 |

The transcript sequence file might represent like

Transcript 1 (of gene A)  |Exon1||Exon2|

Transcript 2 (of gene A)  |Exon1||Exon3|

Transcript 3 (of gene A)  |Exon2||Exon3|

Transcript 4 (of gene A)  |Exon1||Exon2|Exon3|

Whereas the genome sequence file will just show the whole sequence including the introns

ADD REPLYlink written 4.1 years ago by Sam2.3k

Thank you for your reply and explanation. It help me a lot. By the way, I notice the Gene annotation (GTF) in UCSC genome browser can be downloaded as sequence in FASTA format. Is it also usable for eXpress input?

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by bharata1803420

You just have to make sure that the fasta file is something like

>transcript_1

ACTGATCG

>transcript_2

ACTAG

instead of something like

>chr1

ACTGACTG.........

>chr2

AATCACA............

I usually use ensembl reference so I am not sure how your FASTA look like. But as long as you know that the sequence are transcript sequence, then it should be fine.

ADD REPLYlink written 4.1 years ago by Sam2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1861 users visited in the last hour