Add a seq (the outgroup) to the top of its corresponding gene
2
0
Entering edit mode
8.4 years ago
hosseinv ▴ 20

I have a fasta file containing N sequences (outgroups). On the other hand, I have N fast files each having the orthologous gene sequences of several samples.

The idea is to put each of the outgroups at the top of their corresponding genes.

Example:

The outgroup looks like below, OUTGROUPS.fas

> GENE 1
TTAACTCCTGCTACTTTG
> GENE 2
TCTGTCGACGGCAACTGTGAAACTTATC
> GENE 3
GCACCCTGAGCCGAACTGAATTC
> GENE 4
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA


Here is the fasta file of one of my genes, called GENE1.fas

>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG


I want to get the following:

> GENE 1
TTAACTCCTGCTACTTTG
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG


This is done for GENE1, I need same job for the rest of the GENES. As you see, the name of entries in outgroup fasta file (GENE1, GENE2, ...) are the same as their orthologous genes (GENE1.fas, GENE2.fas ...).

I appreciate any help on this post.

Best, Hossein

python fasta concat perl • 2.9k views
0
Entering edit mode

Lemme make sure I understand this correctly:

Each gene FASTA file will always have a name (GENE1.fas) which corresponds to its associated header in the OUTGROUPS.fas file (>GENE 1), correct?

Do you want all of the results in a single output file?

2
Entering edit mode
8.4 years ago
Dan D 7.3k

Assuming:

• The fasta file for each gene will always have a name which corresponds to its associated header in the OUTGROUPS.fas file, minus the spaces, and with an fas extension
• Your fas files are all in the same directory
• You want all of the results in a single output file

Then:

• Paste the following into a file and save it as any name you choose, followed by the extension .py
• Place the file in the same directory with all the other .fas files and run it
seq_dict = {}

def populate_gene(gene):
gene_file_name = gene + ".fas"
gfn = open(gene_file_name, 'r')
gfn.close()
return gene_file_data

with open('OUTGROUPS.fas', 'r') as og:
for rawline in og:
line = rawline.strip()
if line.startswith('>'):
gene = line[1:].replace(' ', '')
seq_dict[gene] = {'raw_gene_name': line, 'head_seq': []}

try:
seq_dict[gene]['gene_parts'] = populate_gene(gene)

except Exception:
print "Wasn't able to open file for {0}.".format(gene)
pass
else:

for gene in seq_dict:
if 'gene_parts' in seq_dict[gene]:
with open('{0}_OUT.fas'.format(gene), 'w') as r:
r.write(seq_dict[gene]['raw_gene_name'] + '\n')
r.writelines(seq_dict[gene]['gene_parts'])

0
Entering edit mode

details modified below.

0
Entering edit mode

Run the script like this:

python Deedee.py


Not like this:

./Deedee.py


I also updated the script to ignore missing gene files (since you implied that there wouldn't always be a corresponding file for a gene name.

0
Entering edit mode

Dear Deedee, I ran python Deedee.py this time with the new script you wrote and now I get the following error:

Traceback (most recent call last):
File "Deedee2.py", line 22, in <module>
print "Wasn't able to open file for {}.".format(gene)
ValueError: zero length field name in format

0
Entering edit mode

Oops, Sunday is not my best day for looking at code :)

0
Entering edit mode

Looks we are moving, but still needs some corrections. This time the code was run and output the result.fas, plus some errors. for the sake of consistency I suggest we both use the following example files:

$ls GENE_1.fas GENE_3.fas OUTGROUPS.fas$ cat GENE_1.fas
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

$cat GENE_3.fas >Ind_1 TTAAATCCGTCTGCTTCTTTG >Ind_2 TTAAATCCTGGTGCTTCTTTG >Ind_3 TTAAATCCTGCTTCTGTCTTTG$ cat OUTGROUP.fas
>GENE_1
TTAACTCCTGCTACTTTG
>GENE_2
TCTGTCGACGGCAACTGTGAAACTTATC
>GENE_3
GCACCCTGAGCCGAACTGAATTC
>GENE_4
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA


As you see, OUTGROUPS.fas has four sequences (genes), whereas I have population data only for two genes (GENE_1.fas, and GENE_3.fas).

The important thing is I want the results INDIVIDUALLY for each of the two available genes. like following:

$ls GENE_1.fas GENE_3.fas OUTGROUPS.fas GENE_1_OUT.fas GENE_3_OUT.fas$ cat GENE_1_OUT.fas
>GENE_1
TTAACTCCTGCTACTTTG
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

$cat GENE_3_OUT.fas >GENE_3 GCACCCTGAGCCGAACTGAATTC >Ind_1 TTAAATCCGTCTGCTTCTTTG >Ind_2 TTAAATCCTGGTGCTTCTTTG >Ind_3 TTAAATCCTGCTTCTGTCTTTG  Thank you Deedee for your patience. ADD REPLY 0 Entering edit mode OK I edited the script. I didn't get a chance to test it but it wasn't a big change. ADD REPLY 0 Entering edit mode This time the code gave me two separate fasta files, GENE_1_OUT.fas and GENE_2_OUT.fas; The first one is done perfectly, but the second one contains just the outgroup sequence (>GENE_2), plus I am expecting the second file to be GENE_3_OUT.fas not GENE_2_OUT.fas ADD REPLY 1 Entering edit mode Shifting requirements are fun! Code updated. ADD REPLY 0 Entering edit mode THANKS A LOT Deedee!! The updated version of the code worked perfectly in my example data set. And I was wondering if I can shift one more time the requirement!! Well the fact is that my gene name in my real data is not like GENE1 GENE2 ..., but instead they look like below: EGLOB_v21_A_0010_147754_150523  So, the entry in the OUTGROUP.fas is like: >EGLOB_v21_A_0010_147754_150523  and, the fasta file will be like: EGLOB_v21_A_0010_147754_150523.fas  The code presumably does not accept these names, says: Wasn't able to open file for EGLOB_v21_A_0010_147754_150523  They are thousands of genes so I don't want to change the names as GENE_1 and so forth. Thanks again for all your help, Deedee. ADD REPLY 0 Entering edit mode Hmmm, that gene name should be okay. Two quick questions: • Did it successfully process other files in the same run where you saw this error? • Can you verify that the file it claims is missing actually exists and has exactly the name EGLOB_v21_A_0010_147754_150523.fas? Thanks! ADD REPLY 0 Entering edit mode Yes I checked it, the names exist and are correct. I think its simply because of the names, as I changed the previous example files with this new name format and it did not work, whereas with names like GENE_1 ..., it works well. ADD REPLY 0 Entering edit mode The new name format on your genes and files shouldn't make any difference at all, as long as there are no spaces anywhere in the file name. The script removes any spaces it finds in the gene name. The script is trying to find EGLOB_v21_A_0010_147754_150523.fas and is not able to. The script should be in the same directory as all of your FASTA files, and should be run from that directory. Similarly, all the FASTA files it's looking for, including OUTGROUPS.fas, should be in that directory. ADD REPLY 0 Entering edit mode Dear Dedee, Nothing changed unless my gene names (other conditions remained same). I tried a number of names like below: EGLOB_v21_A_0010_147754_150523.fas EGLOB_v21_A_0010.fas EGLOB_v21_A.fas EGLOB_v21.fas EGLOB_v.fas EGLOBv.fas EGLOB_1.fas  and I finally found that ((only the last configuration** (EGLOB_1.fas) works, even EGLOBv.fas does not work. Thank you, Hossein ADD REPLY 0 Entering edit mode Hmmm, I'd like to take a look at your actual files. Can you post the compressed files to Dropbox and PM a link to me? ADD REPLY 0 Entering edit mode Dear Deedee, Here is a link to a subset of my data, compressed. https://www.dropbox.com/s/sj7oqhje452k5z0/Deedee.7z Thanks again for your time and patience. Best, Hossein ADD REPLY 1 Entering edit mode Thanks! For some reason, I've no idea why, it seems that using the format() function in your filename string is causing the problem in the version of python you're using. I tried on three different test boxes and I was able to replicate your problem on the third, which was running an older version of Python 2.6. I updated the script accordingly. Let me know if it doesn't work. ADD REPLY 0 Entering edit mode Hi Deedee :) DONE ! The version of mine is 2.6.6, but the updated script did the job perfectly. I thank you a lot for all your help in the past days. Best wishes, Hossein ADD REPLY 1 Entering edit mode 8.4 years ago Alright, I quickly wrote the solution for you (assuming the FASTA files have the same name as the FASTA IDs and have an extension .fasta. Say I begin with three files in the current folder: $ ls
GENE_1.fasta  GENE_2.fasta  OUTGROUPS.fasta


The content of these files are:

$cat OUTGROUPS.fasta >GENE_1 TTAACTCCTGCTACTTTG >GENE_2 TCTGTCGACGGCAACTGTGAAACTTATC >GENE_3 GCACCCTGAGCCGAACTGAATTC >GENE_4 GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA$ cat GENE_1.fasta
>Ind_1
TTAAATCCTGCTTCTTTG
>Ind_2
TTAAATCCTGCTTCTTTG
>Ind_3
TTAAATCCTGCTTCTTTG

$cat GENE_2.fasta >Ind_11 TTAAATCCGTCTGCTTCTTTG >Ind_22 TTAAATCCTGGTGCTTCTTTG >Ind_33 TTAAATCCTGCTTCTGTCTTTG  Then you can use the following one-liner to accomplish all this: $ cat OUTGROUPS.fasta | perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' | xargs -n 2 | while read -r -a Arr; do ( IFS=$'\n'; echo "${Arr[*]}" ); cat ${Arr[0]:1}".fasta" 2>/dev/null; done >GENE_1 TTAACTCCTGCTACTTTG >Ind_1 TTAAATCCTGCTTCTTTG >Ind_2 TTAAATCCTGCTTCTTTG >Ind_3 TTAAATCCTGCTTCTTTG >GENE_2 TCTGTCGACGGCAACTGTGAAACTTATC >Ind_11 TTAAATCCGTCTGCTTCTTTG >Ind_22 TTAAATCCTGGTGCTTCTTTG >Ind_33 TTAAATCCTGCTTCTGTCTTTG >GENE_3 GCACCCTGAGCCGAACTGAATTC >GENE_4 GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA  Please note that perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' is to linearize the OUTGROUPS.fasta file and if it is already linearized then you can do without this bit of code. You may redirect the output by adding > GROUPS.fasta at the end of the one-liner to have one output file. Otherwise, use this oneliner to generate separate files (with extension _OUT.fasta): $ cat OUTGROUPS.fasta | perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' | xargs -n 2 | while read -r -a Arr; do ( IFS=$'\n'; echo "${Arr[*]}" ) > ${Arr[0]:1}"_OUT.fasta"; cat${Arr[0]:1}".fasta" 2>/dev/null >> ${Arr[0]:1}"_OUT.fasta"; done  Best Wishes, Umer ADD COMMENT 0 Entering edit mode Dear Umer, Thank you for your quick reply. I was away from my computer for a few days, that's why I'm writing you late. I ran the first code with > GROUP.fasta at the end, It was run with no error but the result is not what I wanted. Here are some more details about my data. 1. I have same individuals in each single GENE_?.fasta and I want to add the their corresponding outgroup individually too, don't want to combine files. 2. The number of entries in OUTGROUP.fasta is more than the number of genes. Say, if I have 2000 entries in my OUTGROUPS.fasta as following: $ cat OUTGROUPS.fasta
>GENE_1
TTAACTCCTGCTACTTTG
>GENE_2
TCTGTCGACGGCAACTGTGAAACTTATC
.
.
.
>GENE_1999
GCACCCTGAGCCGAACTGAATTC
>GENE_2000
GGTTAACAGAACTTGTTCTTCACATGCAGAGTCTTGA


the number of genes is a sporadic subset of that, say 500 GENE_?.fasta, like following

$ls GENE_1.fasta GENE_2.fasta ... GENE_499.fasta GENE_500.fasta OUTGROUPS.fasta  3. In terms of being linearized or not, both OUTGROUPS.fasta and GENE_?.fasta look like the way you illustrated for the content. Each entry has two lines: one as the header starting with > and the other one as the sequence. And, when I ran the second code, it gave me the following errors: ./Umer2.sh: line 3: syntax error near unexpected token newline' ./Umer2.sh: line 3: cat OUTGROUPS.fasta | perl -pe '/^>/?s/^>/\n>/:s/\s*$// if$.>1' | xargs -n 2 | while read -r -a Arr; do ( IFS=$'\n'; echo "\${Arr[*]}" ) > '


I thank you again for your help and hope to have you feedback on this

Kind regards
Hossein