I Need A Script For Storing The Output Obtained By Gene Prediction Tool Getorf Into Excel Sheet
4
0
Entering edit mode
10.6 years ago
genomelover ▴ 50

Hi to all, I have a output obtained by geneprediction tool getORF which contains the protein sequences along with the FASTA header and I have many such files around 2GB and I need to compile the output of these files into an excel sheet containing two columns one with contain FASTA header information and the other containing the protein sequences.The output of these getorf is pasted below.

>Mid_Vagina_WUGI_1_1 [1 - 69] 
LLFSVIPQTLCTFLYYLSTKPQK
>Mid_Vagina_WUGI_1_2 [3 - 101] 
AILSYSPNPLYISVLPQYQTSKIIHFYYLQHMS
>Mid_Vagina_WUGI_1_3 [2 - 124] 
CYSQLFPKPSVHFCTTSVPNLKNNPLLLSTTYVLEAFLCFH
>Mid_Vagina_WUGI_1_4 [105 - 149] 
RPFCVFINVNSNKDS
>Mid_Vagina_WUGI_1_5 [73 - 159] 
STSIIYNICLRGLFVFSLMLIATKTADFY
>Mid_Vagina_WUGI_1_6 [137 - 187] 
QQRQLTFIECLLCVLVF
>Mid_Vagina_WUGI_1_7 [163 - 195] 
VPTMCSCVLSF
>Mid_Vagina_WUGI_1_8 [153 - 212] 
LLLSAYYVFLCSEFLMNKLG
>Mid_Vagina_WUGI_1_9 [216 - 263]

The output by getORF is pasted here,I need a script which creates an excel table containing two columns

>Mid_Vagina_WUGI_1_1 [1 - 69] 
LLFSVIPQTLCTFLYYLSTKPQK
>Mid_Vagina_WUGI_1_2 [3 - 101] 
AILSYSPNPLYISVLPQYQTSKIIHFYYLQHMS
>Mid_Vagina_WUGI_1_3 [2 - 124] 
CYSQLFPKPSVHFCTTSVPNLKNNPLLLSTTYVLEAFLCFH
>Mid_Vagina_WUGI_1_4 [105 - 149] 
RPFCVFINVNSNKDS
>Mid_Vagina_WUGI_1_5 [73 - 159] 
STSIIYNICLRGLFVFSLMLIATKTADFY
>Mid_Vagina_WUGI_1_6 [137 - 187] 
QQRQLTFIECLLCVLVF
>Mid_Vagina_WUGI_1_7 [163 - 195] 
VPTMCSCVLSF
>Mid_Vagina_WUGI_1_8 [153 - 212] 
LLLSAYYVFLCSEFLMNKLG
>Mid_Vagina_WUGI_1_9 [216 - 263] 
FLQKKLRTIILLILQV
>Mid_Vagina_WUGI_1_10 [206 - 277] 
TWLILTKKATYNYSSHFTSVEKEV
>Mid_Vagina_WUGI_1_11 [202 - 288] 
INLVNSYKKSYVQLFFSFYKCRKGGMMLT
>Mid_Vagina_WUGI_1_12 [267 - 359] 
KRRYDAYLSSLIKLLLPSHTTKKVVSHQWYQ
>Mid_Vagina_WUGI_1_13 [332 - 388] 
KGGFPSMVSVRRKARVFIS
>Mid_Vagina_WUGI_1_14 [363 - 392] 
GGKQEFSFHN
>Mid_Vagina_WUGI_1_15 [396 - 425] 
VLFCGTFSQK
>Mid_Vagina_WUGI_1_16 [392 - 487] 
LSPVLWHIFSEITWRYFLHTNQTWGKDSWKIY
>Mid_Vagina_WUGI_1_17 [301 - 501] 
LSCSSQAIPLKRWFPINGISEEESKSFHFITKSCFVAHFLRNNLEIFSTYQPNVGKGLME
NLLDFLI
>Mid_Vagina_WUGI_1_18 [429 - 509] 
LGDIFYIPTKRGERTHGKFIRFPDLIN
>Mid_Vagina_WUGI_1_19 [508 - 537] 
IKTSNCNYCQ
>Mid_Vagina_WUGI_1_20 [500 - 550] 
SDKLKQAIVITANDRTS
>Mid_Vagina_WUGI_1_21 [528 - 575] 
LLPMTELPDQNQYLSQ
>Mid_Vagina_WUGI_1_22 [541 - 579] 
QNFLTKINICLNE
>Mid_Vagina_WUGI_1_23 [554 - 601] 
PKSISVSMSSVWLFSA
>Mid_Vagina_WUGI_1_24 [579 - 626] 
VVFGYSQPNQDKQKPG
>Mid_Vagina_WUGI_1_25 [583 - 708] 
CLAILSLIKTSRNLDDISSGRSGSKKPSASSGPRDSYNLVHL
>Mid_Vagina_WUGI_1_26 [605 - 751] 
SRQAETWMTSAVVGQGQKSLQLPQAPETPTIWFTFKEKSPKSINCVINC
>Mid_Vagina_WUGI_1_27 [755 - 793] 
RKFCDCFSHFLCA
>Mid_Vagina_WUGI_1_28 [769 - 819] 
LFLSFSVCVDISSLMLR
>Mid_Vagina_WUGI_1_29 [823 - 855] 
SIIKVAKYNGK
>Mid_Vagina_WUGI_1_30 [818 - 862] 
DNQSSKLQNTMGNRS
>Mid_Vagina_WUGI_1_31 [859 - 897] 
KLIFFLSSIMLGY
>Mid_Vagina_WUGI_1_32 [630 - 944] 
HQQWSVRVKKAFSFLRPQRLLQFGSPLKKRAPNPLTVLLTARESSVTVSLIFCVRRYFIF
DAEIINHQSCKIQWEIEVDFFFKQHHAGILRPPQFTGFAFQSKVA
>Mid_Vagina_WUGI_1_33 [878 - 979] 
AASCWDTEAPSVHRFCFSIKSGLSHHQAIYLLLF
>Mid_Vagina_WUGI_1_34 [901 - 1023] 
GPLSSQVLLFNQKWPESSPSNLSTFILIKSSNISSPLPKPA
>Mid_Vagina_WUGI_1_35 [986 - 1024] 
NPQIYRLLYPSQH
>Mid_Vagina_WUGI_1_36 [948 - 1025] 
VITKQFIYFYFNKILKYIVSSTQAST
>Mid_Vagina_WUGI_1_37 [1024 - 977] (REVERSE SENSE) 
VLAWVEETIYLRILLK
>Mid_Vagina_WUGI_1_38 [1025 - 936] (REVERSE SENSE) 
GAGLGRGDDIFEDFIKIKVDKLLGDDSGHF
>Mid_Vagina_WUGI_1_39 [967 - 932] (REVERSE SENSE) 
INCLVMTQATFD
>Mid_Vagina_WUGI_1_40 [948 - 916] (REVERSE SENSE) 
LRPLLIEKQNL
>Mid_Vagina_WUGI_1_41 [912 - 877] (REVERSE SENSE) 
TEGASVSQHDAA
>Mid_Vagina_WUGI_1_42 [883 - 833] (REVERSE SENSE) 
CCLKKKSTSISHCILQL
>Mid_Vagina_WUGI_1_43 [873 - 826] (REVERSE SENSE) 
KKNQLLFPIVFCNFDD
>Mid_Vagina_WUGI_1_44 [932 - 747] (REVERSE SENSE) 
LKSKTCELRGPQYPSMMLLKKKINFYFPLYFATLMIDYLSIKDEISTHTENERNSHRTFS
SS
>Mid_Vagina_WUGI_1_45 [750 - 676] (REVERSE SENSE) 
QLITQLMDLGLFSLKVNQIVGVSGA
>Mid_Vagina_WUGI_1_46 [826 - 635] (REVERSE SENSE) 
LIISASKMKYLRTQKMRETVTELSLAVNNTVNGFGALFFKGEPNCRSLWGLRKLKAFLTL
TDHC
>Mid_Vagina_WUGI_1_47 [645 - 604] (REVERSE SENSE) 
PTTADVIQVSACLD
>Mid_Vagina_WUGI_1_48 [689 - 531] (REVERSE SENSE) 
ESLGPEEAEGFFDPDRPLLMSSRFLLVLIRLRIAKHYSLRQILILVRKFCHWQ
>Mid_Vagina_WUGI_1_49 [559 - 512] (REVERSE SENSE) 
FWSGSSVIGSNYNCLF
>Mid_Vagina_WUGI_1_50 [600 - 490] (REVERSE SENSE) 
AENSQTLLIETDIDFGQEVLSLAVITIACFNLSDQEI
>Mid_Vagina_WUGI_1_51 [508 - 476] (REVERSE SENSE) 
FIRSGNLINFP
>Mid_Vagina_WUGI_1_52 [527 - 384] (REVERSE SENSE) 
LQLLVLIYQIRKSNKFSMSPFPTFGWYVENISKLFLRKCATKQDLVMK
>Mid_Vagina_WUGI_1_53 [483 - 331] (REVERSE SENSE) 
IFHESFPHVWLVCRKYLQVISEKMCHKTGLSYEMKTLAFLLTDTIDGKPPF
>Mid_Vagina_WUGI_1_54 [380 - 291] (REVERSE SENSE) 
KLLLSSSLIPLMGNHLFSGMAWEEQLNQAA
>Mid_Vagina_WUGI_1_55 [346 - 287] (REVERSE SENSE) 
WETTFLVVWLGRSNLIKLLK
>Mid_Vagina_WUGI_1_56 [287 - 258] (REVERSE SENSE) 
VSIIPPFLHL
>Mid_Vagina_WUGI_1_57 [300 - 253] (REVERSE SENSE) 
SSCLSKHHTSFSTLVK
>Mid_Vagina_WUGI_1_58 [283 - 215] (REVERSE SENSE) 
ASYLLFYTCKMRRIIVRSFFCKN
>Mid_Vagina_WUGI_1_59 [211 - 173] (REVERSE SENSE) 
PSLFIKNSEHKNT
>Mid_Vagina_WUGI_1_60 [194 - 159] (REVERSE SENSE) 
KLRTQEHIVGTQ
>Mid_Vagina_WUGI_1_61 [240 - 133] (REVERSE SENSE) 
LYVAFFVRINQVYSLKTQNTRTHSRHSIKVSCLCCY
>Mid_Vagina_WUGI_1_62 [166 - 104] (REVERSE SENSE) 
ALNKSQLSLLLLTLMKTQKGL
>Mid_Vagina_WUGI_1_63 [155 - 90] (REVERSE SENSE) 
KSAVFVAININENTKRPLRHML
>Mid_Vagina_WUGI_1_64 [49 - 17] (REVERSE SENSE) 
GSTEMYRGFGE
>Mid_Vagina_WUGI_1_65 [120 - 13] (REVERSE SENSE) 
KHKKASKTYVVDNRSGLFLRFGTEVVQKCTEGLGNN
>Mid_Vagina_WUGI_1_66 [44 - 3] (REVERSE SENSE) 
YRNVQRVWGITENS
>Mid_Vagina_WUGI_2_1 [3 - 47] 
CHSGYLGIGLSLYLV
>Mid_Vagina_WUGI_2_2 [28 - 57] 
VCLCILSESC
>Mid_Vagina_WUGI_2_3 [61 - 96] 
ALSMETFKICFG
>Mid_Vagina_WUGI_2_4 [2 - 100] 
MSLRVSWYRSVSVSCLNLVEPFQWKHSRFALGK
• 3.2k views
ADD COMMENT
1
Entering edit mode

So you basically just need to replace the right-most space with a comma or tab so you can load it as a csv, or do you need something in addition?

FYI, Excel has an ~1 million row limit, so you might be over that.

Edit: thanks to whomever reformatted this!

ADD REPLY
6
Entering edit mode
10.6 years ago

just linearize your fasta. And please, don't use excel.

awk 'BEGIN{first=0;} /^>/{if(first!=0) printf("\n"); printf("%s\t",$0); first=1;next;} {printf("%s",$0);} END {printf("\n");} ' < in.fa > out.tsv

PS: please, don't use exel

ADD COMMENT
2
Entering edit mode

and don't use excel

ADD REPLY
2
Entering edit mode

"don't use excel", words to live by!

ADD REPLY
0
Entering edit mode

A big chunk scientists will die up if we made an obligation of not using excel !! :) :)

ADD REPLY
0
Entering edit mode

"God will know His own" :-)

ADD REPLY
0
Entering edit mode

More like can't use Excel - this data looks like it will exceed the limit of max number of rows in an Excel sheet!

ADD REPLY
1
Entering edit mode

Just wanted to add, don't use Excel. I'm guessing that you are not using any clever code, specific to Excel, to perform any analysis. You just want 2 columns.

ADD REPLY
0
Entering edit mode
10.6 years ago
always_learning ★ 1.1k

Easiest way

1) Vi filename

2) Press Esc

3) :s/\s/\t/g

4) Press enter

5) Try to open same file with xl

It will work !! let me know if any issue arises...

ADD COMMENT
0
Entering edit mode

That actually won't work as desired. Some of the rows would be 3 columns, others 4, so there would be a mismatch.

ADD REPLY
0
Entering edit mode

Yes, I think so also but unless he will reply, I am not sure about format yet :) :)

ADD REPLY
0
Entering edit mode

Yeah, a lack of details always makes life interesting :)

ADD REPLY
0
Entering edit mode
10.6 years ago

Assuming that the current format of your input is correct (i.e., you just want the first part of the fasta header and the sequence can wrap across lines), then the following C program should work for you. There are more elegant ways to do this and this isn't exactly a bullet-proof program, but it should suffice:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define MAXLINE 1024

void usage(char *prog) {
    printf("Usage: %s input_file.fasta output_file.csv\n", prog);
}

int main(int argc, char *argv[]) {
    FILE *ifile = NULL;
    FILE *ofile = NULL;
    char *line = malloc(sizeof(char) * MAXLINE);
    char *p;
    int i;

    for(i=1; i<argc; i++) {
        if(strcmp(argv[i], "-h") == 0) {
            usage(argv[0]);
            return 0;
        } else if(ifile == NULL) {
            ifile = fopen(argv[i], "r");
        } else if(ofile == NULL) {
            ofile = fopen(argv[i], "w");
        } else {
            usage(argv[0]);
            return -1;
        }
    }

    if(ifile == NULL || ofile == NULL) {
        usage(argv[0]);
        return -1; 
    }

    //read in the first line
    fgets(line, MAXLINE, ifile);
    p = strchr(line, ' ');
    *p = '\0';
    p = line + 1;
    fprintf(ofile, "%s,", p);

    while(1) {
        if(fgets(line, MAXLINE, ifile) == NULL) break;
        if(line[0] == '>') {
            p = strchr(line, ' ');
            *p = '\0';
            p = line + 1;
            fprintf(ofile, "\n%s,", p);
        } else {
            //Trim the line break if needed
            if(line[strlen(line)-1] == '\n') {
                line[strlen(line)-1] = '\0';
            }
            fprintf(ofile, "%s", line);
        }
    }
    fprintf(ofile, "\n");

    fclose(ifile);
    fclose(ofile);
    return 0;
}

Something like that should work for you.

ADD COMMENT
0
Entering edit mode

1) Vi filename

2) Press Esc

3) :s/\s/\t/

4) Press enter

5) Try to open same file with xl

Note: Remove the g from step 3 will also work here !! so its like :s/\s/\t/ in place of :s/\s/\t/g

ADD REPLY
0
Entering edit mode

so complicated ! :-)

ADD REPLY
0
Entering edit mode
10.6 years ago
Clafooty • 0

A convenient way to express differential treatments based on line position is to use the modulus of division on the cursor NR :

awk ' BEGIN {buf=""} ((NR % 2) == 1) { buf=$0 } ((NR % 2) == 0) {print buf "\t" $0}' < test.txt

And by the way every time you run excel, god kill a kitten ^^

ADD COMMENT
0
Entering edit mode

that won't work with the following line:

>Mid_Vagina_WUGI_1_32 [630 - 944] 
HQQWSVRVKKAFSFLRPQRLLQFGSPLKKRAPNPLTVLLTARESSVTVSLIFCVRRYFIF
DAEIINHQSCKIQWEIEVDFFFKQHHAGILRPPQFTGFAFQSKVA

or you can just use paste - - < test.txt

ADD REPLY
0
Entering edit mode

True! it's sad but real proteins may be composed by more than 60 AA ^^ anyway, next time i will think before writing ;-) Why not using :

cat test.txt | tr -d "\n" | tr ">" "\n" | awk '{bak = $NF ; $NF="" ; print $0 "\t" bak}'
ADD REPLY

Login before adding your answer.

Traffic: 2718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6