Question: Sort a fasta file with ~5k sequences and match to a cluster txt file with sequence identifiers
1
gravatar for hridindu.0
2.2 years ago by
hridindu.00
hridindu.00 wrote:

Hey guys. I have a fasta file with about 5k sequences that looks something like this:

>gi|944203132|gb|ALL55193.1| TetR [synthetic construct]
MSRLDKSKVINSALELLNEVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALAIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEDQEHQVAKEERET
PTTDSMPPLLRQAIELFDHQGAEPAFLFGLELIICGLEKQLKCESGSSTLFGGANFNQSGNIADSSLSFTFTNSSNGPNL
ITTQTNSQALSQPIASSNVHDNFMNNEITASKIDDGNNSKPLSPGWTDQTAYNAFGITTGMFNTTTMDDVYNYLFDDEDT
PPNPKKE

>gi|307752607|gb|ADN93291.1| rTetR [synthetic construct] >gi|749446379|gb|AJF22769.1| rTetR [synthetic construct]
MSRLDKSKVINGALELLNGVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALPIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEEQEHQVAKEERET
PTTDSMPPLLRQAIELFDRQGAEPAFLFGLELIICGLEKQLKCESGSSTLFGGANFNQSGNIADSSLSFTFTNSSNGPNL
ITTQTNSQALSQPIASSNVHDNFMNNEITASKIDDGNNSKPLSPGWTDQTAYNAFGITTGMFNTTTMDDVYNYLFDDEDT
PPNPKKE

>gi|409127751|gb|AFV15310.1| TG [Cloning vector pRS416 pRS-TG]
MSRLDKSKVINSALELLNEVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALAIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEDQEHQVAKEERET
PTTDSMPPLLRQAIELFDHQGAEPAFLFGLELIICGLEKQLKCESGSGGGGGGGGGVSKGEELFTGVVPILVELDGDVNG
HKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDG
NYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQ
NTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYKLEGGRA

>gi|953781338|gb|ALO92858.1| TetR-family transcriptional regulator [Streptomyces hygroscopicus subsp. limoneus]
MCAMARPRKPLLSTDRIVETARDLVDREGLAAVSTRRLAAELGVSGPSLYNHFRTKDEILEAVADSVSAQVDLSMFEDGR
DWRTALHDWAVSYRAALRDHPNIVPVLARGPGRRPAALRLADAVYGAMVDAGWPPAQATSIGALMRYFIMGSALGSFAGG
FVDDASAYDPADYPHLGQAHLLAEQQEKIDERAFEAGLAALLDGLAQQYRQVRV

and a file made through CD HIT that clusters them like this:

>Cluster 0
>0  214aa, >gi|953781338|gb|ALO92858.1|... at 93.93%
>1  211aa, >gi|504484658|ref|WP_014671760.1|... at 95.26%
>2  211aa, >gi|664069898|ref|WP_030608495.1|... at 94.79%
>3  213aa, >gi|927897990|ref|WP_053850251.1|... at 92.49%
>4  213aa, >gi|973382757|ref|WP_059125750.1|... at 90.61%
>5  211aa, >gi|664112575|ref|WP_030650029.1|... at 94.79%
>6  210aa, >gi|518972364|ref|WP_020128239.1|... at 91.43%
>7  217aa, >gi|822881081|emb|CQR64871.1|... at 91.71%
>Cluster 1
>0  210aa, >gi|982918533|ref|WP_060397058.1|... at 98.10%
>1  210aa, >gi|981264640|ref|WP_059483404.1|... at 98.10%
>2  210aa, >gi|1061966479|ref|WP_069272452.1|... at 91.90%
>3  210aa, >gi|981738349|ref|WP_059932066.1|... at 98.57%
>4  210aa, >gi|981907955|ref|WP_060092158.1|... at 99.05%
>Cluster 2
>0  226aa, >gi|490297694|ref|WP_004193135.1|... at 98.67%
>1  225aa, >gi|446392873|ref|WP_000470728.1|... at 99.11%
>2  225aa, >gi|661250759|dbj|BAP10656.1|... at 98.67%
>3  225aa, >gi|254967139|gb|ACT97616.1|... at 98.67%
>4  225aa, >gi|545237245|ref|WP_021536379.1|... at 98.67%
>5  221aa, >gi|14794587|gb|AAK73397.1|AF327714_2... at 95.93%

How do I sort the fasta file using the cluster file which only has gi identifiers? Or in other words, how would I concatenate the respective peptide sequences from the fasta to their identifier in the cluster file? I'm using biopython, but I'm very very green to bioinformatics so I apologize if this is a naive question.

blast sequence alignment • 1.2k views
ADD COMMENTlink modified 2.2 years ago by Sej Modha3.9k • written 2.2 years ago by hridindu.00

Have you looked at a bunch of cd-hit output manipulation scripts provided with the package?

ADD REPLYlink written 2.2 years ago by Sej Modha3.9k

Yeah I've been sort of looking through the package for an appropriate script. It's unfortunately not very well documented so there's a little bit of guess work as to what each perl script provided actually does.

ADD REPLYlink written 2.2 years ago by hridindu.00

I am sorry but it's not clear to me what you are trying to do. Could you please explain what is your goal?

ADD REPLYlink written 2.2 years ago by Sej Modha3.9k

I am trying to create a phylogenetically unbiased PSSM for the TetR family of transcription repressors. To do so, I input ~10000 aa sequences from the TetR family (from BLAST) into the CD HIT webserver which uses some algorithm to cluster phylogentically similar sequences. The output file(s) contains only the sequence identifiers >gi|1239392| like shown above assigned to individual clusters, but not the actual sequences. I'm trying to sort either the original input fasta file by the clusters assigned by CD HIT, or sort each individual sequence to it's respective id in the cluster file so that I can create a multiple sequence alignment/PSSM for each cluster separately.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by hridindu.00

Since no one liner enthusiasts commented. I thought i try with awk :). Definitely the python solution by steve is much much better and recommended but I just couldn't help :) . This will pipe each sequence into a separate file which has a filename of cluster id. For example sequences in fasta file which fall in cluster 1 will be piped to file named 1. Surely not a robust solution and fill fail even the slightest formatting difference in the input files but it is just for fun! The input filenames being sequences.clusters for the cluster file and sequences.fasta for the fastafile.

awk 'BEGIN{lock=0}{if(FILENAME == ARGV[1]){if($1 ~ /^>/){if(substr($1,1,4)==">Clu"){x=$2}else{a[$3]=x} }}else{if(substr($1,1,1)==">"){if($1 in a){lock=1;file=a[$1]}else{lock=0}};if(lock == 1){print $0 > file  }  }}'  <(sed -e   's/\s\+/\t/g;s/\.\.\./\t/'  sequences.clusters) sequences.fasta
ADD REPLYlink modified 2.2 years ago by genomax59k • written 2.2 years ago by microfuge950
2

you are very brave. One of the reasons I started using Python was so I didn't have to use awk anymore :)

ADD REPLYlink written 2.2 years ago by steve1.8k

You are hurting my feelings :) :) as awk is just a diminutive of the word awesome :)

ADD REPLYlink written 2.2 years ago by microfuge950
1
gravatar for steve
2.2 years ago by
steve1.8k
United States
steve1.8k wrote:

Going off of only the items you have posted, I would try the following:

First, adjust that Cluster file so that it can be read in by Biopython. In the terminal (bash), I would run something like this:

# the cluster file that you posted, saved to a txt:
$ head cluster.txt
>Cluster 0
>0  214aa, >gi|953781338|gb|ALO92858.1|... at 93.93%
>1  211aa, >gi|504484658|ref|WP_014671760.1|... at 95.26%
>2  211aa, >gi|664069898|ref|WP_030608495.1|... at 94.79%
>3  213aa, >gi|927897990|ref|WP_053850251.1|... at 92.49%
>4  213aa, >gi|973382757|ref|WP_059125750.1|... at 90.61%
>5  211aa, >gi|664112575|ref|WP_030650029.1|... at 94.79%
>6  210aa, >gi|518972364|ref|WP_020128239.1|... at 91.43%
>7  217aa, >gi|822881081|emb|CQR64871.1|... at 91.71%

# get rid of everything except the gi ID's
$ cat cluster.txt | sed -e 's/^>.*\(>.*\)$/\1/g' -e 's/^\(.*\)\(\.\.\..*\)$/\1/'| grep -E -v '^>Cluster.*' > cluster_new.txt

explanation:
sed
-e 's/^>.*\(>.*\)$/\1/g' : match the second '>' character in the string and keep it and all characters that follow it
-e 's/^\(.*\)\(\.\.\..*\)$/\1/' : (after previous operation) match the '...' string and keep only the preceding characters
grep -E -v '^>Cluster.*': return all lines that do not start with '>Cluster'

# preview the output
$ cat cluster_new.txt
>gi|953781338|gb|ALO92858.1|
>gi|504484658|ref|WP_014671760.1|
>gi|664069898|ref|WP_030608495.1|
>gi|927897990|ref|WP_053850251.1|
>gi|973382757|ref|WP_059125750.1|
>gi|664112575|ref|WP_030650029.1|
>gi|518972364|ref|WP_020128239.1|
>gi|822881081|emb|CQR64871.1|

Some details about the sed, bash, and grep used here:

$ bash --version
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)

$ sed --version
GNU sed version 4.2.1

$ grep --version
GNU grep 2.6.3

Now, in Python, I would try something like this:

#!/usr/bin/env python
# python 2.7

# Sort a fasta file with ~5k sequences and match to a cluster txt file with sequence identifiers
# http://biopython.org/wiki/SeqIO
# http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc16

from Bio import SeqIO

# the paths to the input and output files
fasta_file = "/ifs/home/steve/biopython_tests/test.fa" # your unsorted fasta file
cluster_file = "/ifs/home/steve/biopython_tests/cluster_new.txt" # your file with the modified cluster IDs
output_file = "/ifs/home/steve/biopython_tests/sorted.fa" # the output sorted sequences


# start an empty list for the ID
ID_list = []

# make a list of the ID's
for record in SeqIO.parse(open(cluster_file, "rU"), "fasta"):
    # add each record ID to the list
    ID_list.appendrecord.id)

print ID_list


# iterate over the ID's, save the matching fasta entries
for my_ID in ID_list:
    # print my_ID
    for record in SeqIO.parse(open(fasta_file, "rU"), "fasta"):
        if my_ID in record.id:
            # for appending to the output file
            output_handle = open(output_file, "a")
            SeqIO.write(record, output_handle, "fasta")
            output_handle.close()

If you had some more thorough examples it would be easier to test this to make sure it works, and I am sure there are other methods for this that might be better. Note that using only the examples you posted there, this does not actually output anything because none of the Cluster ID's you listed match your example Fasta records.

EDIT: small change to the sed command based on more information included in OP, here is sample output of the script (only one of the given sequences matched the given ID's):

$ cat sorted.fa
>gi|953781338|gb|ALO92858.1| TetR-family transcriptional regulator [Streptomyces hygroscopicus subsp. limoneus]
MCAMARPRKPLLSTDRIVETARDLVDREGLAAVSTRRLAAELGVSGPSLYNHFRTKDEIL
EAVADSVSAQVDLSMFEDGRDWRTALHDWAVSYRAALRDHPNIVPVLARGPGRRPAALRL
ADAVYGAMVDAGWPPAQATSIGALMRYFIMGSALGSFAGGFVDDASAYDPADYPHLGQAH
LLAEQQEKIDERAFEAGLAALLDGLAQQYRQVRV

Python version used here:

$ python --version
Python 2.7.3
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by steve1.8k

Hey Steve, thank's a lot for your reply. I'll edit the post to include some more information. Unfortunately, the grep command simply deleted the "cluster n" headings and left everything else. I'm pretty unfamiliar with regular expressions in UNIX, so I'm trying to understand that line still, maybe so I can modify it to leave the cluster headings and excise everything else but the gi IDs.

ADD REPLYlink written 2.2 years ago by hridindu.00

yes that was what the grep command was supposed to do. The sed commands are operating on the rest of the entries. I can update the post with a little more explanation of them. Keep in mind that the arguments might be slightly different for sed depending on which version you are using; sed on OS X bash has some small differences from the version that usually ships with Linux which I used here.

ADD REPLYlink written 2.2 years ago by steve1.8k

also keep in mind that you could also probably skip the bash steps if you find a way to do something similar in Python, which may or may not be easier. Since you mentioned Biopython I wrote this in the context of making it work with Biopython.

ADD REPLYlink written 2.2 years ago by steve1.8k

updated the post with more details

ADD REPLYlink written 2.2 years ago by steve1.8k

Thank you so much for your help. I think I can take it from here and modify it to my needs as they arise. Really appreciate the level of detail.

ADD REPLYlink written 2.2 years ago by hridindu.00
1
gravatar for Sej Modha
2.2 years ago by
Sej Modha3.9k
Glasgow, UK
Sej Modha3.9k wrote:

cd-hit's make_multi_seq.pl would be helpful for generating individual cluster multi-fasta file.

cd-hit -i db.fa -o dbout -d 0 -g 1
make_multi_seq.pl db.fa dbout.clstr multi-seq 1
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Sej Modha3.9k

My god you made my life so much better with this. Beers are on me if we ever run into eachother.

ADD REPLYlink written 2.2 years ago by hridindu.00
1

Beers can wait but for now you can "accept" this answer (if it does what you needed) by using the check mark against @Sej's post.

ADD REPLYlink written 2.2 years ago by genomax59k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1127 users visited in the last hour