Which genome fasta file and GTF file to be used in the RNA-Seq analysis
1
7
Entering edit mode
4.8 years ago
wangdp123 ▴ 340

Hi there,

There are two types of genome fasta files for human species from GENCODE database (https://www.gencodegenes.org/human/):

  1. Genome sequence, primary assembly (GRCh38) - ALL: Nucleotide sequence of the GRCh38.p12 genome assembly version on all regions, including reference chromosomes, scaffolds, assembly patches and haplotypes
  2. Genome sequence, primary assembly (GRCh38) - PRI: Nucleotide sequence of the GRCh38 primary genome assembly (chromosomes and scaffolds)

Also, there are five types of GTF files:

  1. Comprehensive gene annotation - CHR: It contains the comprehensive gene annotation on the reference chromosomes only
  2. Comprehensive gene annotation - ALL: It contains the comprehensive gene annotation on the reference chromosomes, scaffolds, assembly patches and alternate loci (haplotypes)
  3. Comprehensive gene annotation - PRI: It contains the comprehensive gene annotation on the primary assembly (chromosomes and scaffolds) sequence regions
  4. Basic gene annotation - CHR: It contains the basic gene annotation on the reference chromosomes only
  5. Basic gene annotation - ALL: It contains the basic gene annotation on the reference chromosomes, scaffolds, assembly patches and alternate loci (haplotypes)

The purpose of my study is to identify the differential expression of human protein-coding genes and long non-coding RNA genes.

I wonder which genome file and which GTF file should be used for this aim?

Many thanks,
Tom

RNA-Seq gencode • 8.3k views
ADD COMMENT
0
Entering edit mode

I want to analyze tRNA fragments from small RNA sequencing samples. Which type of annotation file should I use? GENCODE includes tRNA.gtf, as well. Should I use it?

ADD REPLY
0
Entering edit mode

Why is this added as an answer to a 5 year old question? I'm moving it to a comment. Open a new question and in the future, add answers only when you're actually answering the top level question.

ADD REPLY
8
Entering edit mode
4.8 years ago

Hi Tom,

GTF

For lncRNAs and protein coding mRNAs, you can use the first choice, i.e., 'Comprehensive gene annotation - CHR '. However, you should be aware that this will also contain other transcripts of various other biotypes. Information on biotypes can be found here: https://www.gencodegenes.org/pages/biotypes.html

[Edit: the 'Basic gene annotation' also contains lncRNAs, but less isoforms - see comment trail]

The 'ALL' and 'PRI' equivalents of this should also contain lncRNAs and protein coding mRNAs, however, you likely do not require these files.

The Description field on the GENCODE website actually does a good job of explaining the contents of the files.


Reference FASTA

If your aim is to use a 'pseudo' aligner like Kallisto or Salmon, then you actually just need the 'Transcript sequences - CHR' FASTA. If, however, you are using a program that requires a genome FASTA, like HTseq, TopHat, HISAT, etc., then the best choice for most cases is 'Genome sequence, primary assembly (GRCh38) - PRI'.

Further information on the genomes here: http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Kevin

ADD COMMENT
0
Entering edit mode

Thanks for the answer.

I have two further questions about this.

1) Should we keep the GTF file and reference FASTA file consistent? For example, if we choose "Genome sequence, primary assembly (GRCh38) - PRI" for FASTA file and we have to choose "Comprehensive gene annotation - PRI" for GTF file?

2) Is "Basic gene annotation - CHR" or "Basic gene annotation - ALL" better for GTF file as it only contains transcripts with tag as 'basic"? (Based on my understanding, those transcripts without "basic" tags are not of high quality annotation)

ADD REPLY
1
Entering edit mode

In answer to your '1)', it is not critical, however, it would be good practice to keep them matched as best as possible. For some programs, you do not even require the GTF and just require the transcriptome FASTA, which has annotation information encoded in the FASTA headers.

For '2)', the basic gene annotation contains genes with the tag 'basic'. I do not know how many other tags exist, but there are probably varying levels of evidence behind each tag.

The reality is that the Comprehensive and Basic annotations contain the same number of genes:

zcat gencode.v31.basic.annotation.gtf.gz | cut -f2 -d';' | grep -e "gene_type" | sort | uniq -c 
     14  gene_type "IG_C_gene"
      9  gene_type "IG_C_pseudogene"
     37  gene_type "IG_D_gene"
     18  gene_type "IG_J_gene"
      3  gene_type "IG_J_pseudogene"
      1  gene_type "IG_pseudogene"
    144  gene_type "IG_V_gene"
    188  gene_type "IG_V_pseudogene"
  16840  gene_type "lncRNA"
   1881  gene_type "miRNA"
   2212  gene_type "misc_RNA"
      2  gene_type "Mt_rRNA"
     22  gene_type "Mt_tRNA"
     42  gene_type "polymorphic_pseudogene"
  10175  gene_type "processed_pseudogene"
  19975  gene_type "protein_coding"
     18  gene_type "pseudogene"
      8  gene_type "ribozyme"
     52  gene_type "rRNA"
    500  gene_type "rRNA_pseudogene"
     49  gene_type "scaRNA"
      1  gene_type "scRNA"
    942  gene_type "snoRNA"
   1901  gene_type "snRNA"
      5  gene_type "sRNA"
   1064  gene_type "TEC"
    491  gene_type "transcribed_processed_pseudogene"
    129  gene_type "transcribed_unitary_pseudogene"
    918  gene_type "transcribed_unprocessed_pseudogene"
      2  gene_type "translated_processed_pseudogene"
      2  gene_type "translated_unprocessed_pseudogene"
      6  gene_type "TR_C_gene"
      4  gene_type "TR_D_gene"
     79  gene_type "TR_J_gene"
      4  gene_type "TR_J_pseudogene"
    106  gene_type "TR_V_gene"
     33  gene_type "TR_V_pseudogene"
     97  gene_type "unitary_pseudogene"
   2628  gene_type "unprocessed_pseudogene"
      1  gene_type "vaultRNA"

zcat gencode.v31.annotation.gtf.gz | cut -f2 -d';' | grep -e "gene_type" | sort | uniq -c 
     14  gene_type "IG_C_gene"
      9  gene_type "IG_C_pseudogene"
     37  gene_type "IG_D_gene"
     18  gene_type "IG_J_gene"
      3  gene_type "IG_J_pseudogene"
      1  gene_type "IG_pseudogene"
    144  gene_type "IG_V_gene"
    188  gene_type "IG_V_pseudogene"
  16840  gene_type "lncRNA"
   1881  gene_type "miRNA"
   2212  gene_type "misc_RNA"
      2  gene_type "Mt_rRNA"
     22  gene_type "Mt_tRNA"
     42  gene_type "polymorphic_pseudogene"
  10175  gene_type "processed_pseudogene"
  19975  gene_type "protein_coding"
     18  gene_type "pseudogene"
      8  gene_type "ribozyme"
     52  gene_type "rRNA"
    500  gene_type "rRNA_pseudogene"
     49  gene_type "scaRNA"
      1  gene_type "scRNA"
    942  gene_type "snoRNA"
   1901  gene_type "snRNA"
      5  gene_type "sRNA"
   1064  gene_type "TEC"
    491  gene_type "transcribed_processed_pseudogene"
    129  gene_type "transcribed_unitary_pseudogene"
    918  gene_type "transcribed_unprocessed_pseudogene"
      2  gene_type "translated_processed_pseudogene"
      2  gene_type "translated_unprocessed_pseudogene"
      6  gene_type "TR_C_gene"
      4  gene_type "TR_D_gene"
     79  gene_type "TR_J_gene"
      4  gene_type "TR_J_pseudogene"
    106  gene_type "TR_V_gene"
     33  gene_type "TR_V_pseudogene"
     97  gene_type "unitary_pseudogene"
   2628  gene_type "unprocessed_pseudogene"
      1  gene_type "vaultRNA"

Ultimately, it is your task to decide which is best for your experiment, and you can do that by looking inside each GTF and FASTA reference sequence.

ADD REPLY

Login before adding your answer.

Traffic: 2711 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6