Add a seq (the outgroup) to the top of its corresponding gene
2
0
Entering edit mode
10.5 years ago
hosseinv ▴ 20

I have a fasta file containing N sequences (outgroups). On the other hand, I have N fast files each having the orthologous gene sequences of several samples.

The idea is to put each of the outgroups at the top of their corresponding genes.

Example:

The outgroup looks like below, OUTGROUPS.fas

> GENE 1
TTAACTCCTGCTACTTTG
> GENE 2
TCTGTCGACGGCAACTGTGAAACTTATC
> GENE 3
GCACCCTGAGCCGAACTGAATTC
> GENE 4
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA

Here is the fasta file of one of my genes, called GENE1.fas

>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

I want to get the following:

> GENE 1
TTAACTCCTGCTACTTTG
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

This is done for GENE1, I need same job for the rest of the GENES. As you see, the name of entries in outgroup fasta file (GENE1, GENE2, ...) are the same as their orthologous genes (GENE1.fas, GENE2.fas ...).

I appreciate any help on this post.

Best, Hossein

python fasta concat perl • 4.7k views
ADD COMMENT
0
Entering edit mode

Lemme make sure I understand this correctly:

Each gene FASTA file will always have a name (GENE1.fas) which corresponds to its associated header in the OUTGROUPS.fas file (>GENE 1), correct?

Do you want all of the results in a single output file?

ADD REPLY
2
Entering edit mode
10.5 years ago
Dan D 7.4k

Assuming:

  • The fasta file for each gene will always have a name which corresponds to its associated header in the OUTGROUPS.fas file, minus the spaces, and with an fas extension
  • Your fas files are all in the same directory
  • You want all of the results in a single output file

Then:

  • Paste the following into a file and save it as any name you choose, followed by the extension .py
  • Place the file in the same directory with all the other .fas files and run it
seq_dict = {}

def populate_gene(gene):
    gene_file_name = gene + ".fas"
    gfn = open(gene_file_name, 'r')
    gene_file_data = gfn.readlines()
    gfn.close()
    return gene_file_data

with open('OUTGROUPS.fas', 'r') as og:
    for rawline in og:
        line = rawline.strip()
        if line.startswith('>'):
            gene = line[1:].replace(' ', '')
            seq_dict[gene] = {'raw_gene_name': line, 'head_seq': []}

            try:
                seq_dict[gene]['gene_parts'] = populate_gene(gene)

            except Exception:
                print "Wasn't able to open file for {0}.".format(gene)
                pass
        else:
            seq_dict[gene]['head_seq'].append(rawline)

for gene in seq_dict:
    if 'gene_parts' in seq_dict[gene]:
        with open('{0}_OUT.fas'.format(gene), 'w') as r:
            r.write(seq_dict[gene]['raw_gene_name'] + '\n')
            r.writelines(seq_dict[gene]['head_seq'])
            r.writelines(seq_dict[gene]['gene_parts'])
ADD COMMENT
0
Entering edit mode

details modified below.

ADD REPLY
0
Entering edit mode

Run the script like this:

python Deedee.py

Not like this:

./Deedee.py

I also updated the script to ignore missing gene files (since you implied that there wouldn't always be a corresponding file for a gene name.

ADD REPLY
0
Entering edit mode

Dear Deedee, I ran python Deedee.py this time with the new script you wrote and now I get the following error:

Traceback (most recent call last):
  File "Deedee2.py", line 22, in <module>
    print "Wasn't able to open file for {}.".format(gene)
ValueError: zero length field name in format
ADD REPLY
0
Entering edit mode

Oops, Sunday is not my best day for looking at code :)

Please try the updated script.

ADD REPLY
0
Entering edit mode

Looks we are moving, but still needs some corrections. This time the code was run and output the result.fas, plus some errors. for the sake of consistency I suggest we both use the following example files:

$ ls
GENE_1.fas GENE_3.fas OUTGROUPS.fas

$ cat GENE_1.fas
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

$ cat GENE_3.fas
>Ind_1
TTAAATCCGTCTGCTTCTTTG
>Ind_2
TTAAATCCTGGTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTGTCTTTG

$ cat OUTGROUP.fas
>GENE_1
TTAACTCCTGCTACTTTG
>GENE_2
TCTGTCGACGGCAACTGTGAAACTTATC
>GENE_3
GCACCCTGAGCCGAACTGAATTC
>GENE_4
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA

As you see, OUTGROUPS.fas has four sequences (genes), whereas I have population data only for two genes (GENE_1.fas, and GENE_3.fas).

The important thing is I want the results INDIVIDUALLY for each of the two available genes. like following:

$ ls

GENE_1.fas GENE_3.fas  OUTGROUPS.fas GENE_1_OUT.fas GENE_3_OUT.fas

$ cat GENE_1_OUT.fas
>GENE_1
TTAACTCCTGCTACTTTG
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

$ cat GENE_3_OUT.fas
>GENE_3
GCACCCTGAGCCGAACTGAATTC
>Ind_1
TTAAATCCGTCTGCTTCTTTG
>Ind_2
TTAAATCCTGGTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTGTCTTTG

Thank you Deedee for your patience.

ADD REPLY
0
Entering edit mode

OK I edited the script. I didn't get a chance to test it but it wasn't a big change.

ADD REPLY
0
Entering edit mode

This time the code gave me two separate fasta files, GENE_1_OUT.fas and GENE_2_OUT.fas;

The first one is done perfectly, but the second one contains just the outgroup sequence (>GENE_2), plus I am expecting the second file to be GENE_3_OUT.fas not GENE_2_OUT.fas

ADD REPLY
1
Entering edit mode

Shifting requirements are fun! Code updated.

ADD REPLY
0
Entering edit mode

THANKS A LOT Deedee!!

The updated version of the code worked perfectly in my example data set. And I was wondering if I can shift one more time the requirement!!

Well the fact is that my gene name in my real data is not like GENE1 GENE2 ..., but instead they look like below:

EGLOB_v21_A_0010_147754_150523

So, the entry in the OUTGROUP.fas is like:

>EGLOB_v21_A_0010_147754_150523

and, the fasta file will be like:

EGLOB_v21_A_0010_147754_150523.fas

The code presumably does not accept these names, says:

Wasn't able to open file for EGLOB_v21_A_0010_147754_150523

They are thousands of genes so I don't want to change the names as GENE_1 and so forth.

Thanks again for all your help, Deedee.

ADD REPLY
0
Entering edit mode

Hmmm, that gene name should be okay.

Two quick questions:

  • Did it successfully process other files in the same run where you saw this error?
  • Can you verify that the file it claims is missing actually exists and has exactly the name EGLOB_v21_A_0010_147754_150523.fas?

Thanks!

ADD REPLY
0
Entering edit mode

Yes I checked it, the names exist and are correct.

I think its simply because of the names, as I changed the previous example files with this new name format and it did not work, whereas with names like GENE_1 ..., it works well.

ADD REPLY
0
Entering edit mode

The new name format on your genes and files shouldn't make any difference at all, as long as there are no spaces anywhere in the file name. The script removes any spaces it finds in the gene name. The script is trying to find EGLOB_v21_A_0010_147754_150523.fas and is not able to.

The script should be in the same directory as all of your FASTA files, and should be run from that directory. Similarly, all the FASTA files it's looking for, including OUTGROUPS.fas, should be in that directory.

ADD REPLY
0
Entering edit mode

Dear Dedee,

Nothing changed unless my gene names (other conditions remained same).

I tried a number of names like below:

EGLOB_v21_A_0010_147754_150523.fas
EGLOB_v21_A_0010.fas
EGLOB_v21_A.fas
EGLOB_v21.fas
EGLOB_v.fas
EGLOBv.fas
EGLOB_1.fas

and I finally found that ((only the last configuration** (EGLOB_1.fas) works, even EGLOBv.fas does not work.

Thank you, Hossein

ADD REPLY
0
Entering edit mode

Hmmm, I'd like to take a look at your actual files. Can you post the compressed files to Dropbox and PM a link to me?

ADD REPLY
0
Entering edit mode

Dear Deedee,

Here is a link to a subset of my data, compressed.

https://www.dropbox.com/s/sj7oqhje452k5z0/Deedee.7z

Thanks again for your time and patience.

Best,
Hossein

ADD REPLY
1
Entering edit mode

Thanks!

For some reason, I've no idea why, it seems that using the format() function in your filename string is causing the problem in the version of python you're using. I tried on three different test boxes and I was able to replicate your problem on the third, which was running an older version of Python 2.6. I updated the script accordingly. Let me know if it doesn't work.

ADD REPLY
0
Entering edit mode

Hi Deedee :)

DONE !

The version of mine is 2.6.6, but the updated script did the job perfectly.

I thank you a lot for all your help in the past days.

Best wishes,
Hossein

ADD REPLY
1
Entering edit mode
10.5 years ago

Alright, I quickly wrote the solution for you (assuming the FASTA files have the same name as the FASTA IDs and have an extension .fasta.

Say I begin with three files in the current folder:

$ ls
GENE_1.fasta  GENE_2.fasta  OUTGROUPS.fasta

The content of these files are:

$ cat OUTGROUPS.fasta
>GENE_1
TTAACTCCTGCTACTTTG
>GENE_2
TCTGTCGACGGCAACTGTGAAACTTATC
>GENE_3
GCACCCTGAGCCGAACTGAATTC
>GENE_4
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA

$ cat GENE_1.fasta
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

$ cat GENE_2.fasta
>Ind_11
TTAAATCCGTCTGCTTCTTTG
>Ind_22
TTAAATCCTGGTGCTTCTTTG
>Ind_33
TTAAATCCTGCTTCTGTCTTTG

Then you can use the following one-liner to accomplish all this:

$ cat OUTGROUPS.fasta | perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' | xargs -n 2 | while read -r -a Arr; do ( IFS=$'\n'; echo "${Arr[*]}" ); cat ${Arr[0]:1}".fasta" 2>/dev/null; done
>GENE_1
TTAACTCCTGCTACTTTG
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG
>GENE_2
TCTGTCGACGGCAACTGTGAAACTTATC
>Ind_11
TTAAATCCGTCTGCTTCTTTG
>Ind_22
TTAAATCCTGGTGCTTCTTTG
>Ind_33
TTAAATCCTGCTTCTGTCTTTG
>GENE_3
GCACCCTGAGCCGAACTGAATTC
>GENE_4
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA

Please note that perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' is to linearize the OUTGROUPS.fasta file and if it is already linearized then you can do without this bit of code.

You may redirect the output by adding > GROUPS.fasta at the end of the one-liner to have one output file. Otherwise, use this oneliner to generate separate files (with extension _OUT.fasta):

$ cat OUTGROUPS.fasta | perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' | xargs -n 2 | while read -r -a Arr; do ( IFS=$'\n'; echo "${Arr[*]}" ) > ${Arr[0]:1}"_OUT.fasta"; cat ${Arr[0]:1}".fasta" 2>/dev/null >> ${Arr[0]:1}"_OUT.fasta"; done

Best Wishes,
Umer

ADD COMMENT
0
Entering edit mode

Dear Umer,

Thank you for your quick reply. I was away from my computer for a few days, that's why I'm writing you late.

I ran the first code with > GROUP.fasta at the end, It was run with no error but the result is not what I wanted.

Here are some more details about my data.

  1. I have same individuals in each single GENE_?.fasta and I want to add the their corresponding outgroup individually too, don't want to combine files.
  2. The number of entries in OUTGROUP.fasta is more than the number of genes. Say, if I have 2000 entries in my OUTGROUPS.fasta as following:

    $ cat OUTGROUPS.fasta
    >GENE_1
    TTAACTCCTGCTACTTTG
    >GENE_2
    TCTGTCGACGGCAACTGTGAAACTTATC
    .
    .
    .
    >GENE_1999
    GCACCCTGAGCCGAACTGAATTC
    >GENE_2000
    GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA
    

    the number of genes is a sporadic subset of that, say 500 GENE_?.fasta, like following

    $ ls
    GENE_1.fasta  GENE_2.fasta  ... GENE_499.fasta GENE_500.fasta OUTGROUPS.fasta
    
  3. In terms of being linearized or not, both OUTGROUPS.fasta and GENE_?.fasta look like the way you illustrated for the content. Each entry has two lines: one as the header starting with > and the other one as the sequence.

And, when I ran the second code, it gave me the following errors:

./Umer2.sh: line 3: syntax error near unexpected token `newline'
./Umer2.sh: line 3: `cat OUTGROUPS.fasta | perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' | xargs -n 2 | while read -r -a Arr; do ( IFS=$'\n'; echo "${Arr[*]}" ) > '

I thank you again for your help and hope to have you feedback on this

Kind regards
Hossein

ADD REPLY

Login before adding your answer.

Traffic: 861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6