Question: Join multiple sequence lines in to one.
0
gravatar for MB
13 months ago by
MB20
MB20 wrote:

I have a file seq.txt which consists of multiple aligned sequences:

B_phora_cucurbitarum    -----------------------------------------------------MSHIKRD
E_aceosorus_bombacis    RAQAPPGSHNDQPPLLDPLSGILSPLGLGGLTPRSDSLPEHLEMQRRHILERLNERDEDV
A_nfragosa_RCEF_1005    ---------------------------------------------------MKYSILHLA
X_crodochium_bolleyi    -------------------------------------------MRLSNIAGQLAVGAACL
R_cillium_camemberti    --------------------------------------MRILTTGLLLWLLSLINLVSAF
[Bin                                                                               ]

B_phora_cucurbitarum    LSRISGGIGGFLSSIANNIYVFSWDFSLFLLNLVAFKRKVGKVTLEGNPGFGGKWPEYIP
E_aceosorus_bombacis    RAQAPPGSHNDQPPL--------QLAVGAACLEHLEM------------LERLNERDEDV
A_nfragosa_RCEF_1005    ---------MTENALSAEDLAKRG---LDKREVSYTGRITTTFDAAAQLVSNTGVHAFQA
X_crodochium_bolleyi    NDQPPLLDPLSGILSPLGLGGLTP------------------MRL-SNIAGQLAVGAACL
R_cillium_camemberti    -------------------------MS------DIPVHQHSDGRCPVTGISGSNPHPFCP
[Bin                                                                               ]

I want to combine them in a single line according to their names like this:

>B_phora_cucurbitarum   -----------------------------------------------------MSHIKRDLSRISGGIGGFLSSIANNIYVFSWDFSLFLLNLVAFKRKVGKVTLEGNPGFGGKWPEYIP
>E_aceosorus_bombacis   RAQAPPGSHNDQPPLLDPLSGILSPLGLGGLTPRSDSLPEHLEMQRRHILERLNERDEDVRAQAPPGSHNDQPPL--------QLAVGAACLEHLEM------------LERLNERDEDV
>A_nfragosa_RCEF_1005   ---------------------------------------------------MKYSILHLA---------MTENALSAEDLAKRG---LDKREVSYTGRITTTFDAAAQLVSNTGVHAFQA
>X_crodochium_bolleyi   -------------------------------------------MRLSNIAGQLAVGAACLNDQPPLLDPLSGILSPLGLGGLTP------------------MRL-SNIAGQLAVGAACL
>R_cillium_camemberti   --------------------------------------MRILTTGLLLWLLSLINLVSAF-------------------------MS------DIPVHQHSDGRCPVTGISGSNPHPFCP

Awk or sed commands are preferred, any help would be appreciated. Thanks!

awk sed sequence alignment • 383 views
ADD COMMENTlink modified 13 months ago by cpad011211k • written 13 months ago by MB20
2
gravatar for Bastien Hervé
13 months ago by
Bastien Hervé4.4k
Limoges, CBRS, France
Bastien Hervé4.4k wrote:

As I'm a goat in awk, I let you a python solution

###Create a dictionnary containing your seq_merge.txt
merged_dict={}
###Open your seq table
with open("seq_merge.txt", 'r') as f:
    for line in f:
        ###Do a key/value dictionnary
        id_seq = line.rstrip().split("\t")[0]
        seq = line.rstrip().split("\t")[1]
        ###Check if the key exists in the dictionnary
        if id_seq not in merged_dict:
            merged_dict[id_seq] = seq
        else:
            merged_dict[id_seq] += seq
###Write in new file
with open("new_seq_merge.txt", "a") as new_seq_merge:
    for key, value in merged_dict.iteritems():
        new_seq_merge.write(">"+key+"\t"+value+"\n")
ADD COMMENTlink written 13 months ago by Bastien Hervé4.4k

It is giving an error: Traceback (most recent call last): File "seq.py", line 8, in <module> seq = line.rstrip().split("\t")[1] IndexError: list index out of range

ADD REPLYlink written 13 months ago by MB20

What is your delimiter in this line : B_phora_cucurbitarum -----------------------------------------------------MSHIKRD ? Tabulation ? 4 spaces ?

ADD REPLYlink written 13 months ago by Bastien Hervé4.4k

it is tabulation....

ADD REPLYlink written 13 months ago by MB20

It's hard to investigate without your file.

Try to print line.split("\t") before id_seq = line.rstrip().split("\t")[0]

ADD REPLYlink written 13 months ago by Bastien Hervé4.4k

Does these lines really exist in your file ?

[Bin                                                                               ]
ADD REPLYlink written 13 months ago by Bastien Hervé4.4k

yes, they are. I found its the format problem, after converting to Unix format, it worked fine. Thanks a lot!

ADD REPLYlink written 13 months ago by MB20

goat= greatest of all time...

ADD REPLYlink written 13 months ago by cpad011211k

I was thinking about the animal

ADD REPLYlink written 13 months ago by Bastien Hervé4.4k

probably, typing before lunch?

ADD REPLYlink written 13 months ago by cpad011211k

Nah, expression from my country to say that i'm very bad at awk writing

ADD REPLYlink written 13 months ago by Bastien Hervé4.4k
2
gravatar for 5heikki
13 months ago by
5heikki8.4k
Finland
5heikki8.4k wrote:
mkdir whatever
cp inputfile whatever
cd whatever
awk 'BEGIN{FS="\t";ORS=""}{print $2 >> $1}' inputfile
rm inputfile
for f in *; do awk -v N="$f" 'BEGIN{OFS="\t"}{print N,$0}' $f; done > newFile

I ignored the [Bin lines here. You can delete that file before the for loop

ADD COMMENTlink written 13 months ago by 5heikki8.4k
1
gravatar for kloetzl
13 months ago by
kloetzl1.0k
European Union
kloetzl1.0k wrote:
cat seq.txt| awk 'BEGIN{c=1}!/\[/{if(NF){n[c]=$1;s[c]=s[c] $2;c++}}/\[/{c=1}END{for( i in n){print n[i], s[i]}}'
ADD COMMENTlink written 13 months ago by kloetzl1.0k

Thanks but it's not working, it is just printing ---------------------------------------- in each line with some characters in between.

ADD REPLYlink modified 13 months ago • written 13 months ago by MB20
1
gravatar for cpad0112
13 months ago by
cpad011211k
India
cpad011211k wrote:
$ sed -n '/^$/d;/Bin/!p' test.txt| sed -e 's/\s\+/\t/g'  | sort -s -k 1,1 | datamash -g1 collapse 2 | sed 's/,//g'|  awk '{print ">"$1"\n"$2}'

>A_nfragosa_RCEF_1005
---------------------------------------------------MKYSILHLA---------MTENALSAEDLAKRG---LDKREVSYTGRITTTFDAAAQLVSNTGVHAFQA
>B_phora_cucurbitarum
-----------------------------------------------------MSHIKRDLSRISGGIGGFLSSIANNIYVFSWDFSLFLLNLVAFKRKVGKVTLEGNPGFGGKWPEYIP
>E_aceosorus_bombacis
RAQAPPGSHNDQPPLLDPLSGILSPLGLGGLTPRSDSLPEHLEMQRRHILERLNERDEDVRAQAPPGSHNDQPPL--------QLAVGAACLEHLEM------------LERLNERDEDV
>R_cillium_camemberti
--------------------------------------MRILTTGLLLWLLSLINLVSAF-------------------------MS------DIPVHQHSDGRCPVTGISGSNPHPFCP
>X_crodochium_bolleyi
-------------------------------------------MRLSNIAGQLAVGAACLNDQPPLLDPLSGILSPLGLGGLTP------------------MRL-SNIAGQLAVGAACL

Install datamash either from here or from distro repos (for debian based; sudo apt install datamash -y; for conda, conda install datamash -y).

ADD COMMENTlink modified 13 months ago • written 13 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1603 users visited in the last hour