RNA-seq using rasflow - issues with annotation file
0
0
Entering edit mode
18 months ago
cdeantoneo31 ▴ 20

Hi everyone!

I have used rasflow in the past to process and analyze RNA-sequencing data in human and mouse cells, but now I am trying to do the same with a viral genome and seem to be running into issues with the annotation file.

the genome and annotation I'm using is available here, and the gff looks as below

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM629838v1
#!genome-build-accession NCBI_Assembly:GCA_006298385.1
##sequence-region HM159613.1 1 3860
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10886
HM159613.1  Genbank region  1   3860    .   +   .   ID=HM159613.1:1..3860;Dbxref=taxon:10886;Name=L1;collected-by=Bernard N. Fields;collection-date=Jan-2002;country=USA;gbkey=Src;isolate=Dearing;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Mammalian Orthoreovirus 3 Strain Dearing;segment=L1;serotype=3;strain=T3D
HM159613.1  Genbank gene    25  3828    .   +   .   ID=gene-L1;Name=L1;gbkey=Gene;gene=L1;gene_biotype=protein_coding
HM159613.1  Genbank CDS 25  3828    .   +   0   ID=cds-ADJ00316.1;Parent=gene-L1;Dbxref=NCBI_GP:ADJ00316.1;Name=ADJ00316.1;gbkey=CDS;gene=L1;product=lambda-3 protein;protein_id=ADJ00316.1
##sequence-region HM159614.1 1 3916
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10886
HM159614.1  Genbank region  1   3916    .   +   .   ID=HM159614.1:1..3916;Dbxref=taxon:10886;Name=L2;collected-by=Bernard N. Fields;collection-date=Jan-2002;country=USA;gbkey=Src;isolate=Dearing;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Mammalian Orthoreovirus 3 Strain Dearing;segment=L2;serotype=3;strain=T3D
HM159614.1  Genbank gene    14  3883    .   +   .   ID=gene-L2;Name=L2;gbkey=Gene;gene=L2;gene_biotype=protein_coding
HM159614.1  Genbank CDS 14  3883    .   +   0   ID=cds-ADJ00317.1;Parent=gene-L2;Dbxref=NCBI_GP:ADJ00317.1;Name=ADJ00317.1;gbkey=CDS;gene=L2;product=lambda-2 protein;protein_id=ADJ00317.1
##sequence-region HM159615.1 1 3901
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10886
HM159615.1  Genbank region  1   3901    .   +   .   ID=HM159615.1:1..3901;Dbxref=taxon:10886;Name=L3;collected-by=Bernard N. Fields;collection-date=Jan-2002;country=USA;gbkey=Src;isolate=Dearing;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Mammalian Orthoreovirus 3 Strain Dearing;segment=L3;serotype=3;strain=T3D
HM159615.1  Genbank gene    14  3841    .   +   .   ID=gene-L3;Name=L3;gbkey=Gene;gene=L3;gene_biotype=protein_coding
HM159615.1  Genbank CDS 14  3841    .   +   0   ID=cds-ADJ00318.1;Parent=gene-L3;Dbxref=NCBI_GP:ADJ00318.1;Name=ADJ00318.1;gbkey=CDS;gene=L3;product=lambda-1 protein;protein_id=ADJ00318.1

When I run rasflow, I get the following error

       ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v1.6.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P D2.sort.bam                                    ||
||                                                                            ||
||             Output file : D2_count.tsv                                     ||
||                 Summary : D2_count.tsv.summary                             ||
||              Annotation : GCA_006298385.1_ASM629838v1_genomic.gff (GTF)    ||
||      Dir for temp files : output/input_v_IP/genome/countFile               ||
||                                                                            ||
||                 Threads : 24                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file GCA_006298385.1_ASM629838v1_genomic.gff ...           ||
ERROR: no features were loaded in format GTF. The annotation format can be specified by the '-F' option, and the required feature type can be specified by the '-t' option..
The porgram has to terminate.

[Thu Oct  6 17:30:51 2022]
Error in rule featureCount:
    jobid: 0
    output: output/input_v_IP/genome/countFile/D2_count.tsv, output/input_v_IP/genome/countFile/D2_count.tsv.summary

RuleException:
CalledProcessError in line 109 of /home/carly/Desktop/RASflow/workflow/align_count_genome.rules:
Command ' set -euo pipefail;  featureCounts -p -T 24 -t exon -g region -a ref/annotation/GCA_006298385.1_ASM629838v1_genomic.gff -o output/input_v_IP/genome/countFile/D2_count.tsv data/output/input_v_IP/genome/bamFileSort/D2.sort.bam && tail -n +3 output/input_v_IP/genome/countFile/D2_count.tsv | cut -f1,7 > temp.D2 && mv temp.D2 output/input_v_IP/genome/countFile/D2_count.tsv ' returned non-zero exit status 255.
  File "/home/carly/Desktop/RASflow/workflow/align_count_genome.rules", line 109, in __rule_featureCount
  File "/home/carly/miniconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message

I've tried using gff read to convert it to a gtf, and get the same error. This is what the converted file looks like

HM159613.1  Genbank transcript  25  3828    .   +   .   transcript_id "gene-L1"; gene_id "gene-L1"; gene_name "L1"
HM159613.1  Genbank CDS 25  3828    .   +   0   transcript_id "gene-L1"; gene_name "L1";
HM159614.1  Genbank transcript  14  3883    .   +   .   transcript_id "gene-L2"; gene_id "gene-L2"; gene_name "L2"
HM159614.1  Genbank CDS 14  3883    .   +   0   transcript_id "gene-L2"; gene_name "L2";
HM159615.1  Genbank transcript  14  3841    .   +   .   transcript_id "gene-L3"; gene_id "gene-L3"; gene_name "L3"
HM159615.1  Genbank CDS 14  3841    .   +   0   transcript_id "gene-L3"; gene_name "L3";
HM159616.1  Genbank transcript  14  2224    .   +   .   transcript_id "gene-M1"; gene_id "gene-M1"; gene_name "M1"
HM159616.1  Genbank CDS 14  2224    .   +   0   transcript_id "gene-M1"; gene_name "M1";
HM159617.1  Genbank transcript  31  2160    .   +   .   transcript_id "gene-M2"; gene_id "gene-M2"; gene_name "M2"
HM159617.1  Genbank CDS 31  2160    .   +   0   transcript_id "gene-M2"; gene_name "M2";
HM159618.1  Genbank transcript  19  2184    .   +   .   transcript_id "gene-M3"; gene_id "gene-M3"; gene_name "M3"
HM159618.1  Genbank CDS 19  2184    .   +   0   transcript_id "gene-M3"; gene_name "M3";
HM159619.1  Genbank transcript  13  1380    .   +   .   transcript_id "gene-S1"; gene_id "gene-S1"; gene_name "S1"
HM159619.1  Genbank CDS 13  1380    .   +   0   transcript_id "gene-S1"; gene_name "S1";
HM159620.1  Genbank transcript  19  1275    .   +   .   transcript_id "gene-S2"; gene_id "gene-S2"; gene_name "S2"
HM159620.1  Genbank CDS 19  1275    .   +   0   transcript_id "gene-S2"; gene_name "S2";
HM159621.1  Genbank transcript  28  1128    .   +   .   transcript_id "gene-S3"; gene_id "gene-S3"; gene_name "S3"
HM159621.1  Genbank CDS 28  1128    .   +   0   transcript_id "gene-S3"; gene_name "S3";
HM159622.1  Genbank transcript  33  1130    .   +   .   transcript_id "gene-S4"; gene_id "gene-S4"; gene_name "S4"
HM159622.1  Genbank CDS 33  1130    .   +   0   transcript_id "gene-S4"; gene_name "S4";

does anyone have a solution to this?

gtf rasflow gff RNA-seq • 437 views
ADD COMMENT
0
Entering edit mode

I've figured out how to fix this, and I want to put the solution down here for future readers who may have a similar issue.

I've gone into the align_count_genomes.rules file in workflow, and changed "exon" to CDS in the featurecounts area, so it now reads

rule featureCount:
    input:
        sort = intermediate_path + "/bamFileSort/{sample}.sort.bam",
        annotation = config["ANNOTATION"]
    output:
        count = final_path + "/countFile/{sample}_count.tsv",
        count_summary = final_path + "/countFile/{sample}_count.tsv.summary"
    run:
        if config["COUNTER"]=="featureCounts":
            if config["END"]=="pair":
                shell("featureCounts -p -T {config[NCORE]} -t CDS -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
            else:
                shell("featureCounts  -T {config[NCORE]} -t CDS -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
        elif config["COUNTER"]=="htseq-count":
            shell("htseq-count -f bam -i {config[ATTRIBUTE]} -s no -t CDS {input.sort} {input.annotation} | sed '/^__/ d' > {output.count}")

Based on the mouse gtf file ive used in the past, I am guessing the original was giving an error because the third column of the mouse gtf has "exon" while this one has only CDS or transcript. I think changing the code in the rules file fixed this. I was also able to just use the original gff file

ADD REPLY

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6