Question: Which genome fasta file and GTF file to be used in the RNA-Seq analysis
0
gravatar for wangdp123
12 days ago by
wangdp123160
Oxford
wangdp123160 wrote:

Hi there,

There are two types of genome fasta files for human species from GENCODE database (https://www.gencodegenes.org/human/):

1) Genome sequence, primary assembly (GRCh38) - ALL: Nucleotide sequence of the GRCh38.p12 genome assembly version on all regions, including reference chromosomes, scaffolds, assembly patches and haplotypes

2) Genome sequence, primary assembly (GRCh38) - PRI: Nucleotide sequence of the GRCh38 primary genome assembly (chromosomes and scaffolds)

Also, there are five types of GTF files:

1) Comprehensive gene annotation - CHR: It contains the comprehensive gene annotation on the reference chromosomes only

2) Comprehensive gene annotation - ALL: It contains the comprehensive gene annotation on the reference chromosomes, scaffolds, assembly patches and alternate loci (haplotypes)

3) Comprehensive gene annotation - PRI: It contains the comprehensive gene annotation on the primary assembly (chromosomes and scaffolds) sequence regions

4) Basic gene annotation - CHR: It contains the basic gene annotation on the reference chromosomes only

5) Basic gene annotation - ALL: It contains the basic gene annotation on the reference chromosomes, scaffolds, assembly patches and alternate loci (haplotypes)

The purpose of my study is to identify the differential expression of human protein-coding genes and long non-coding RNA genes.

I wonder which genome file and which GTF file should be used for this aim?

Many thanks,

Tom

gencode rna-seq • 145 views
ADD COMMENTlink modified 12 days ago by Kevin Blighe45k • written 12 days ago by wangdp123160
3
gravatar for Kevin Blighe
12 days ago by
Kevin Blighe45k
Kevin Blighe45k wrote:

Hi Tom,

GTF

For lncRNAs and protein coding mRNAs, you can use the first choice, i.e., 'Comprehensive gene annotation - CHR '. However, you should be aware that this will also contain other transcripts of various other biotypes. Information on biotypes can be found here: https://www.gencodegenes.org/pages/biotypes.html

[Edit: the 'Basic gene annotation' also contains lncRNAs, but less isoforms - see comment trail]

The 'ALL' and 'PRI' equivalents of this should also contain lncRNAs and protein coding mRNAs, however, you likely do not require these files.

The Description field on the GENCODE website actually does a good job of explaining the contents of the files.


Reference FASTA

If your aim is to use a 'pseudo' aligner like Kallisto or Salmon, then you actually just need the 'Transcript sequences - CHR' FASTA. If, however, you are using a program that requires a genome FASTA, like HTseq, TopHat, HISAT, etc., then the best choice for most cases is 'Genome sequence, primary assembly (GRCh38) - PRI'.

Further information on the genomes here: http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Kevin

ADD COMMENTlink modified 10 days ago • written 12 days ago by Kevin Blighe45k

Thanks for the answer.

I have two further questions about this.

1) Should we keep the GTF file and reference FASTA file consistent? For example, if we choose "Genome sequence, primary assembly (GRCh38) - PRI" for FASTA file and we have to choose "Comprehensive gene annotation - PRI" for GTF file?

2) Is "Basic gene annotation - CHR" or "Basic gene annotation - ALL" better for GTF file as it only contains transcripts with tag as 'basic"? (Based on my understanding, those transcripts without "basic" tags are not of high quality annotation)

ADD REPLYlink modified 10 days ago • written 10 days ago by wangdp123160

In answer to your '1)', it is not critical, however, it would be good practice to keep them matched as best as possible. For some programs, you do not even require the GTF and just require the transcriptome FASTA, which has annotation information encoded in the FASTA headers.

For '2)', the basic gene annotation contains genes with the tag 'basic'. I do not know how many other tags exist, but there are probably varying levels of evidence behind each tag.

The reality is that the Comprehensive and Basic annotations contain the same number of genes:

zcat gencode.v31.basic.annotation.gtf.gz | cut -f2 -d';' | grep -e "gene_type" | sort | uniq -c 
     14  gene_type "IG_C_gene"
      9  gene_type "IG_C_pseudogene"
     37  gene_type "IG_D_gene"
     18  gene_type "IG_J_gene"
      3  gene_type "IG_J_pseudogene"
      1  gene_type "IG_pseudogene"
    144  gene_type "IG_V_gene"
    188  gene_type "IG_V_pseudogene"
  16840  gene_type "lncRNA"
   1881  gene_type "miRNA"
   2212  gene_type "misc_RNA"
      2  gene_type "Mt_rRNA"
     22  gene_type "Mt_tRNA"
     42  gene_type "polymorphic_pseudogene"
  10175  gene_type "processed_pseudogene"
  19975  gene_type "protein_coding"
     18  gene_type "pseudogene"
      8  gene_type "ribozyme"
     52  gene_type "rRNA"
    500  gene_type "rRNA_pseudogene"
     49  gene_type "scaRNA"
      1  gene_type "scRNA"
    942  gene_type "snoRNA"
   1901  gene_type "snRNA"
      5  gene_type "sRNA"
   1064  gene_type "TEC"
    491  gene_type "transcribed_processed_pseudogene"
    129  gene_type "transcribed_unitary_pseudogene"
    918  gene_type "transcribed_unprocessed_pseudogene"
      2  gene_type "translated_processed_pseudogene"
      2  gene_type "translated_unprocessed_pseudogene"
      6  gene_type "TR_C_gene"
      4  gene_type "TR_D_gene"
     79  gene_type "TR_J_gene"
      4  gene_type "TR_J_pseudogene"
    106  gene_type "TR_V_gene"
     33  gene_type "TR_V_pseudogene"
     97  gene_type "unitary_pseudogene"
   2628  gene_type "unprocessed_pseudogene"
      1  gene_type "vaultRNA"

zcat gencode.v31.annotation.gtf.gz | cut -f2 -d';' | grep -e "gene_type" | sort | uniq -c 
     14  gene_type "IG_C_gene"
      9  gene_type "IG_C_pseudogene"
     37  gene_type "IG_D_gene"
     18  gene_type "IG_J_gene"
      3  gene_type "IG_J_pseudogene"
      1  gene_type "IG_pseudogene"
    144  gene_type "IG_V_gene"
    188  gene_type "IG_V_pseudogene"
  16840  gene_type "lncRNA"
   1881  gene_type "miRNA"
   2212  gene_type "misc_RNA"
      2  gene_type "Mt_rRNA"
     22  gene_type "Mt_tRNA"
     42  gene_type "polymorphic_pseudogene"
  10175  gene_type "processed_pseudogene"
  19975  gene_type "protein_coding"
     18  gene_type "pseudogene"
      8  gene_type "ribozyme"
     52  gene_type "rRNA"
    500  gene_type "rRNA_pseudogene"
     49  gene_type "scaRNA"
      1  gene_type "scRNA"
    942  gene_type "snoRNA"
   1901  gene_type "snRNA"
      5  gene_type "sRNA"
   1064  gene_type "TEC"
    491  gene_type "transcribed_processed_pseudogene"
    129  gene_type "transcribed_unitary_pseudogene"
    918  gene_type "transcribed_unprocessed_pseudogene"
      2  gene_type "translated_processed_pseudogene"
      2  gene_type "translated_unprocessed_pseudogene"
      6  gene_type "TR_C_gene"
      4  gene_type "TR_D_gene"
     79  gene_type "TR_J_gene"
      4  gene_type "TR_J_pseudogene"
    106  gene_type "TR_V_gene"
     33  gene_type "TR_V_pseudogene"
     97  gene_type "unitary_pseudogene"
   2628  gene_type "unprocessed_pseudogene"
      1  gene_type "vaultRNA"

Ultimately, it is your task to decide which is best for your experiment, and you can do that by looking inside each GTF and FASTA reference sequence.

ADD REPLYlink modified 10 days ago • written 10 days ago by Kevin Blighe45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1036 users visited in the last hour