Entering edit mode
5.9 years ago
cgias
▴
20
Hello,
So when I finally made all the other threads I posted here work, I was told to change my genome & annotation file.... as an intern that I am, here we go again!
I am having some problems on creating the index with my files from NCBI.
The command I am using is
STAR --runMode genomeGenerate --runThreadN 6 --genomeDir /home/carolgs/Cpapaya/STARindex2 --genomeFastaFiles GCA_000150535.1_Papaya1.0_genomic.fna --sjdbGTFfile ref_Papaya1.0_top_level.gff3 --sjdbGTFtagExonParentTranscript Parent --sjdbOverhang 49
I've tried it after converting to gtf
STAR --runMode genomeGenerate --runThreadN 6 --genomeDir /home/carolgs/Cpapaya/STARindex2 --genomeFastaFiles GCA_000150535.1_Papaya1.0_genomic.fna --sjdbGTFfile ref_Papaya1.0_top_level.gtf --sjdbOverhang 49
But I'm still getting the same Log:
[...]
May 09 12:11:42 ... completed Suffix Array index
May 09 12:11:42 ..... processing annotations GTF
Fatal INPUT FILE error, no valid exon lines in the GTF file: ref_Papaya1.0_top_level.gtf
Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.
May 09 12:11:44 ...... FATAL ERROR, exiting
Here is the first lines from my gff file:
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Papaya1.0
#!genome-build-accession NCBI_Assembly:GCF_000150535.2
#!annotation-date 4 August 2017
#!annotation-source NCBI Carica papaya Annotation Release 100
##sequence-region NW_019011177.1 1 833
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3649
NW_019011177.1 RefSeq region 1 833 . + . ID=id0;Dbxref=taxon:3649;Name=LG1;chromosome=LG1;cultivar=SunUp;gbkey=Src;genome=genomic;map=unlocalized;mol_type=genomic DNA;note=transgenic for the papaya ringspot virus (PRSV) coat protein%2C which confers resistance to PRSV infection
##sequence-region NW_019011178.1 1 1200
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3649
NW_019011178.1 RefSeq region 1 1200 . + . ID=id1;Dbxref=taxon:3649;Name=LG1;chromosome=LG1;cultivar=SunUp;gbkey=Src;genome=genomic;map=unlocalized;mol_type=genomic DNA;note=transgenic for the papaya ringspot virus (PRSV) coat protein%2C which confers resistance to PRSV infection
##sequence-region NW_019011179.1 1 1384
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3649
NW_019011179.1 RefSeq region 1 1384 . + . ID=id2;Dbxref=taxon:3649;Name=LG1;chromosome=LG1;cultivar=SunUp;gbkey=Src;genome=genomic;map=unlocalized;mol_type=genomic DNA;note=transgenic for the papaya ringspot virus (PRSV) coat protein%2C which confers resistance to PRSV infection
NW_019011179.1 Gnomon gene 5 1378 . - . ID=gene0;Dbxref=GeneID:110806311;Name=LOC110806311;gbkey=Gene;gene=LOC110806311;gene_biotype=protein_coding;partial=true;start_range=.,5
NW_019011179.1 Gnomon mRNA 5 1378 . - . ID=rna0;Parent=gene0;Dbxref=GeneID:110806311,Genbank:XM_022034873.1;Name=XM_022034873.1;gbkey=mRNA;gene=LOC110806311;model_evidence=Supporting evidence includes similarity to: 9 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 6 samples with support for all annotated introns;partial=true;product=IAA-amino acid hydrolase ILR1-like;start_range=.,5;transcript_id=XM_022034873.1
NW_019011179.1 Gnomon exon 1314 1378 . - . ID=id3;Parent=rna0;Dbxref=GeneID:110806311,Genbank:XM_022034873.1;gbkey=mRNA;gene=LOC110806311;partial=true;product=IAA-amino acid hydrolase ILR1-like;transcript_id=XM_022034873.1
NW_019011179.1 Gnomon exon 741 1055 . - . ID=id4;Parent=rna0;Dbxref=GeneID:110806311,Genbank:XM_022034873.1;gbkey=mRNA;gene=LOC110806311;partial=true;product=IAA-amino acid hydrolase ILR1-like;transcript_id=XM_022034873.1
NW_019011179.1 Gnomon exon 449 577 . - . ID=id5;Parent=rna0;Dbxref=GeneID:110806311,Genbank:XM_022034873.1;gbkey=mRNA;gene=LOC110806311;partial=true;product=IAA-amino acid hydrolase ILR1-like;transcript_id=XM_022034873.1
NW_019011179.1 Gnomon exon 5 328 . - . ID=id6;Parent=rna0;Dbxref=GeneID:110806311,Genbank:XM_022034873.1;gbkey=mRNA;gene=LOC110806311;partial=true;product=IAA-amino acid hydrolase ILR1-like;start_range=.,5;transcript_id=XM_022034873.1
NW_019011179.1 Gnomon CDS 1314 1364 . - 0 ID=cds0;Parent=rna0;Dbxref=GeneID:110806311,Genbank:XP_021890565.1;Name=XP_021890565.1;gbkey=CDS;gene=LOC110806311;partial=true;product=IAA-amino acid hydrolase ILR1-like;protein_id=XP_021890565.1
NW_019011179.1 Gnomon CDS 741 1055 . - 0 ID=cds0;Parent=rna0;Dbxref=GeneID:110806311,Genbank:XP_021890565.1;Name=XP_021890565.1;gbkey=CDS;gene=LOC110806311;partial=true;product=IAA-amino acid hydrolase ILR1-like;protein_id=XP_021890565.1
NW_019011179.1 Gnomon CDS 449 577 . - 0 ID=cds0;Parent=rna0;Dbxref=GeneID:110806311,Genbank:XP_021890565.1;Name=XP_021890565.1;gbkey=CDS;gene=LOC110806311;partial=true;product=IAA-amino acid hydrolase ILR1-like;protein_id=XP_021890565.1
NW_019011179.1 Gnomon CDS 5 328 . - 0 ID=cds0;Parent=rna0;Dbxref=GeneID:110806311,Genbank:XP_021890565.1;Name=XP_021890565.1;gbkey=CDS;gene=LOC110806311;partial=true;product=IAA-amino acid hydrolase ILR1-like;protein_id=XP_021890565.1;start_range=.,5
##sequence-region NW_019011180.1 1 1405
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3649
NW_019011180.1 RefSeq region 1 1405 . + . ID=id7;Dbxref=taxon:3649;Name=LG1;chromosome=LG1;cultivar=SunUp;gbkey=Src;genome=genomic;map=unlocalized;mol_type=genomic DNA;note=transgenic for the papaya ringspot virus (PRSV) coat protein%2C which confers resistance to PRSV infection
##sequence-region NW_019011181.1 1 1505
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3649
NW_019011181.1 RefSeq region 1 1505 . + . ID=id8;Dbxref=taxon:3649;Name=LG1;chromosome=LG1;cultivar=SunUp;gbkey=Src;genome=genomic;map=unlocalized;mol_type=genomic DNA;note=transgenic for the papaya ringspot virus (PRSV) coat protein%2C which confers resistance to PRSV infection
##sequence-region NW_019011182.1 1 1521
And the converted to gtf file:
NC_010323.1 RefSeq exon 63 137 . - . transcript_id "rna30121"; gene_id "gene20201"; gene_name "trnH-GUG";
NC_010323.1 RefSeq CDS 592 1653 . - 0 transcript_id "gene20202"; gene_id "gene20202"; gene_name "psbA";
NC_010323.1 RefSeq exon 1924 1959 . - . transcript_id "gene20203"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq exon 4534 4569 . - . transcript_id "gene20203"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq exon 1924 1958 . - . transcript_id "rna30122"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq exon 4533 4569 . - . transcript_id "rna30122"; gene_id "gene20203"; gene_name "trnK-UUU";
NC_010323.1 RefSeq CDS 2266 3786 . - 0 transcript_id "gene20204"; gene_id "gene20204"; gene_name "matK";
NC_010323.1 RefSeq exon 5529 5755 . - . transcript_id "gene20205"; gene_id "gene20205"; gene_name "rps16";
NC_010323.1 RefSeq exon 6639 6678 . - . transcript_id "gene20205"; gene_id "gene20205"; gene_name "rps16";
NC_010323.1 RefSeq CDS 5529 5755 . - 2 transcript_id "gene20205"; gene_id "gene20205"; gene_name "rps16";
NC_010323.1 RefSeq CDS 6639 6678 . - 0 transcript_id "gene20205"; gene_id "gene20205"; gene_name "rps16";
NC_010323.1 RefSeq exon 7344 7415 . - . transcript_id "rna30123"; gene_id "gene20206"; gene_name "trnQ-UUG";
NC_010323.1 RefSeq CDS 7781 7966 . + 0 transcript_id "gene20207"; gene_id "gene20207"; gene_name "psbK";
NC_010323.1 RefSeq CDS 8408 8518 . + 0 transcript_id "gene20208"; gene_id "gene20208"; gene_name "psbI";
NC_010323.1 RefSeq exon 8617 8704 . - . transcript_id "rna30124"; gene_id "gene20209"; gene_name "trnS-GCU";
NC_010323.1 RefSeq exon 9669 9691 . + . transcript_id "gene20210"; gene_id "gene20210"; gene_name "trnG-UCC";
NC_010323.1 RefSeq exon 10427 10475 . + . transcript_id "gene20210"; gene_id "gene20210"; gene_name "trnG-UCC";
NC_010323.1 RefSeq exon 9669 9691 . + . transcript_id "rna30125"; gene_id "gene20210"; gene_name "trnG-UCC";
NC_010323.1 RefSeq exon 10427 10475 . + . transcript_id "rna30125"; gene_id "gene20210"; gene_name "trnG-UCC";
NC_010323.1 RefSeq exon 10658 10729 . + . transcript_id "rna30126"; gene_id "gene20211"; gene_name "trnR-UCU";
NC_010323.1 RefSeq CDS 11031 12548 . - 0 transcript_id "gene20212"; gene_id "gene20212"; gene_name "atpA";
NC_010323.1 RefSeq exon 12614 13023 . - . transcript_id "gene20213"; gene_id "gene20213"; gene_name "atpF";
NC_010323.1 RefSeq exon 13790 13934 . - . transcript_id "gene20213"; gene_id "gene20213"; gene_name "atpF";
NC_010323.1 RefSeq CDS 12614 13023 . - 2 transcript_id "gene20213"; gene_id "gene20213"; gene_name "atpF";
NC_010323.1 RefSeq CDS 13790 13934 . - 0 transcript_id "gene20213"; gene_id "gene20213"; gene_name "atpF";
NC_010323.1 RefSeq CDS 14500 14745 . - 0 transcript_id "gene20214"; gene_id "gene20214"; gene_name "atpH";
NC_010323.1 RefSeq CDS 15945 16688 . - 0 transcript_id "gene20215"; gene_id "gene20215"; gene_name "atpI";
NC_010323.1 RefSeq CDS 16922 17632 . - 0 transcript_id "gene20216"; gene_id "gene20216"; gene_name "rps2";
So clearly I do have exons lines in both files. I can't see where is the problem.
Thanks in advance,