Convert fasta headers to GFF
2
1
Entering edit mode
4.9 years ago

I have Illumina mapped paired end reads (RNA seq) on human transcriptome (ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz) with tophat2 Now I have to analyze the results with HTseq. But I can't find the correponding GFF of the Homo_sapiens.GRCh38.cds.all.fa The headers of the fasta sequences seems to include all usefull informations :

>ENST00000631427.1 cds chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142362570:142363134:1 gene:ENSG00000282543.1 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:AC234635.3 description:T cell receptor beta variable 4-3  [Source:UniProtKB/Swiss-Prot;Acc:A0A589]
>ENST00000632148.1 cds chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142370836:142371348:1 gene:ENSG00000282353.1 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:AC234635.1 description:T cell receptor beta variable 6-3  [Source:UniProtKB/Swiss-Prot;Acc:P0DPF7]
>ENST00000631392.1 cds chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142374511:142375050:1 gene:ENSG00000282506.1 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRBV7-2 description:T cell receptor beta variable 7-2 [Source:HGNC Symbol;Acc:HGNC:12236]
>ENST00000455382.2 cds chromosome:GRCh38:7:142300924:142301432:1 gene:ENSG00000226660.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRBV2 description:T cell receptor beta variable 2 [Source:HGNC Symbol;Acc:HGNC:12195]
>ENST00000390387.3 cds chromosome:GRCh38:7:142308542:142309048:1 gene:ENSG00000237702.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRBV3-1 description:T cell receptor beta variable 3-1 [Source:HGNC Symbol;Acc:HGNC:12212]
>ENST00000390357.3 cds chromosome:GRCh38:7:142313184:142313666:1 gene:ENSG00000211710.3 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRBV4-1 description:T cell receptor beta variable 4-1 [Source:HGNC Symbol;Acc:HGNC:12215]

Any advices on how to parse these fasta headers to buikd the GFF ? Existing tool ?

Thanks !

RNA-Seq gff parsing tophat HTseq • 2.4k views
ADD COMMENT
1
Entering edit mode

That sounds like a very cumbersome approach. Why not simply look for the correct GFF file, that must be available for human. You might not have a GFF file for (only) the CDS though, but you don't need that, one that covers everything on the genome will do fine

ADD REPLY
0
Entering edit mode

Actually, I have serched, but If I look the Homo_sapiens.GRCh38.96.chr.gff3 some scaffold identifiers are missing : for example "CHR_HSCHR7_2_CTG6" If I look the Homo_sapiens.GRCh38.96.abinitio.gff3 thses identifiers are present but the CDS identifiers are diffrent: For example "GENSCAN00000017672" instead of "ENST00000631427"

:/

ADD REPLY
0
Entering edit mode

Human is not really my field of research but have you looked here as well? :

https://www.gencodegenes.org/human/

ADD REPLY
0
Entering edit mode

Thanks i.sudbery and Devon Ryan. :) I thinks I'm going to start over the mapping on the primary assembly of the human genome. And use STAR or Salmon

ADD REPLY
0
Entering edit mode

Use salmon followed by tximport. Computationally inexpensive while using current GC bias correction models and accounting for different transcript lengths when aggregating tx counts to the gene level.

ADD REPLY
0
Entering edit mode

use STAR for genome and Salmon for transcriptome and not mix them up ;)

ADD REPLY
3
Entering edit mode
4.9 years ago
  1. Use a GTF file instead, GFF is an awkward poorly supported "format".
  2. CHR_HSCHR7_2_CTG6 shouldn't be in your fasta file, you've incorrectly used the "top level" fasta file, when you should be using the primary assembly instead.
  3. Unless this is the continuation of a long-running project, use a more modern tool than tophat2 (e.g., STAR or salmon).
  4. HTseq is very slow, use featureCounts from the subRead package instead.
  5. You've aligned against the transcriptome, you can't use the BAM file with htseq-count or featureCounts (you can with RSEM or salmon, though).
ADD COMMENT
1
Entering edit mode

Hi @Devon, my apology for hijacking this thread. I'd tend to disagree with your opinion on GTF vs. GFF. While I have no particular affection for GFF, at least it comes with comprehensive format specifications - which I haven't seen for GTF, yet. Of course specs are only as good as their implementation, but fmpov GTF is the awkward "format". Is the human annotation better in its GTF version?

ADD REPLY
1
Entering edit mode

GFF might be a more formally complete format, but that doesn't change the fact that it isn't as widely supported by tools as GTF. For example, in the HTSeq docs, the usage statement reffers to GFF, but the FAQs refer to GTF, and the links, even in usage reffer to the definition of GFF2, not GFF3. . featureCounts requires GTF not GFF.

ADD REPLY
1
Entering edit mode

That spec shows how poorly defined GFF is for holding gene information. GFF is very flexible, which is the exact opposite of what's needed. Where should one find gene IDs and names? Are they the IDs (presumably)? Half the time with GFF files encountered in the wild those are just meaningless running numbers with actual IDs elsewhere in a custom field. That's all clear in a GTF file.

Is the human annotation better in its GTF version?

In that it's clear where to find information and tools actually support it, yes.

ADD REPLY
0
Entering edit mode

Thank you both. I know the GFF3 flexibility is a nightmare. Especially because too many people don't seem to care much about the (very) long specs.

Part of my job is integration of genomes and their annotation into structured databases, and from that perspective GTF isn't any better (for example resolving gene->transcript->protein->exon structures with multiple exons, splice variants and sequences)

ADD REPLY
2
Entering edit mode
4.9 years ago

Its not true that all the information necessary to build the gff is present in the fasta header. The fasta header only contains the start/end coordinates of the transcript, but not the start/end of each exon, which is necessary to build the gff. It simply can't be done.

HOWEVER, if you have mapped to the transcriptome, then you shouldn't use HTSeq to analyse the results. When you map against the transcriptome, the mapper treats each transcript as a separate contig. Thus a read will be annotated as coming from

Contig              start
---------           -----
ENST00000455382     100

HTSeq uses a GFF file which says that where a gene is on the genome. HTSeq takes reads mapped to the genome (not transcriptome) and calculates if they over lap with any genes. So for example, when mapped to the genome, the above read would be

Contig              start
---------           -----
7                   142301024

HTSeq would look at this read, see that ENST00000455382 spans 142300924 to 142301432, figure out that 142301024 is between these limits and add one to the count of ENST00000455382 (actaully this is a simplifcation, as it would only do so if the read only overlapped the exonic regions of ENST00000455382 ).

Your BAM file from mapping to the trancriptome has no concept of genome position, and so the excerise is pointless.

To get counts from a transcriptome mapping you have two options.

You can extract the number of reads that map to each contig (in this case transcript) using samtools idxstats. However the problem with this is that almost all genes have multiple transcripts, thus almost all your reads will map to more than one place in your transcriptome. You can iether allow this, and double count these reads, or not allow it, and only count a small fraction of your reads.

Alternatively, use a tool designed for quantifying against the transcriptome. The obvious choice here would be either salmon in alignment processing mode, or RSEM.

ADD COMMENT

Login before adding your answer.

Traffic: 1936 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6