Trying to use CD-HIT with a fasta file with 5 mers but not getting any output.
1
0
Entering edit mode
21 months ago
schlogl ▴ 90

Hi there fellas

I am here again. I am trying to cluster a fasta file with a bunch of 5mers (peptides). I download and installed (conda) the cd-hit software.

The file is something like this (that is the initial 10 seqs from the real file):

>0
AAAAA
>1
AAAAC
>2
>3
AAAAE
>4
AAAAF
>5
AAAAG
>6
AAAAH
>7
AAAAI
>8
AAAAK
>9
AAAAL
>10
AAAAM


I using the basics commands like suggested in: https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHIT Command:

\$ cd-hit -i all_poss_1000.txt -c 0.9 -o all_poss_1000_clustered -n 5


However I got this output:

================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Oct 26 2019, 14:51:47
Command: cd-hit -i all_poss_1000.txt -c 0.9 -o
all_poss_1000_clustered -n 5

Started: Tue Jan 28 11:09:21 2020
================================================================
Output
----------------------------------------------------------------
total seq: 0
longest and shortest : 0 and 18446744073709551615
Total letters: 0
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 75M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 90520634

0  finished          0  clusters

Approximated maximum memory consumption: 75M
writing new database
writing clustering information
program completed !

Total CPU time 0.05


Question: what I am doing wrong???
My input? I divide the full file in files with 1000 entries.

Thank you for your time and help.

Paulo

PS- I will need to read a little bit about the subject! But I really appreciate you help and tips guys! Thank you very much...see ya soon probably 8)

gene clustering CD-HIT • 1.2k views
1
Entering edit mode

Could you post an extract of the real file?

Did you process the file in any other environment than linux? If so there might be hidden characters in it. Moreover, i don't think that a word size (kmer) of 5 makes sense here as all sequences have a different 5mer so none of them will "merge"

1
Entering edit mode

Hi lieven, thats the first 10 sequences form my input file! 8)

1
Entering edit mode

alright, I thought it was a dummy example. my bad

1
Entering edit mode

I think the other answers here are worth experimenting with, and ordinarily I'd advise the exact opposite of what I'm about to suggest, but I think, for this, since the data looks reasonably simple, you don't need something as heavyweight as CD-HIT.

If all of your 5mers are similar to the data you posted, they're essentially already 'aligned' so there's no need to do any alignment step, and instead we can just treat it as a string similarity problem.

Very simply, you can do all-vs-all comparisons and compute a Levenshtein/Hamming/Cosine distance between the 'words' and assign them to clusters based on similarity thresholds (something using a scoring function and itertools.combinations in python would work I'd expect.

Altenatively, you could go down a slightly more heuristic approach with something like: https://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups

The only issue I can think with this approach is whether you're considering 'similarity' to mean the biological definition of amino acid similarity, in which case it is a bit more complicated than just word matching.

0
Entering edit mode

Hi Joe

I have done something like this:

alphabet = 'ACDEFGHIKLMNPQRSTVWY'

def get_all_possible_kmers(alphabet, k):
"""Returns a list of all possible combinations of kmers of length k from a input alphabet."""
return (''.join(x) for x in itertools.product(alphabet, repeat=k))

def clustering_kmers(kmers_lst):
"""Returns a set of aggruped kmers of length k."""
clusters = defaultdict(set)
for item in kmers_lst:
return clusters


(You get 20**5 kmers.)

To agroup the mer according with the aac composition. I think maybe it can become easier to find out if a mer or group of mers belongs to a class with some important functional motif in a protein. It agroupd mers like that abcd, a2bcd as keys for the mers AAAAA, AAAAB, etc as values.

I am not still thinking in any evolutionary classification.

But I was thinking in one approach like using some python module to try a clustering as you mention.

Again I REALLY appreciate you help and attention, you and others here are always very helping.

Thank you guys some much.

Paulo

0
Entering edit mode

What I mean is, in your example data, all of those sequences are 1 residue different from one another. In my understanding, this makes them all equivalent as a cluster with a 'distance' of 1. Is this what you are looking for, or do you need some more meaningful clustering whereby amino acid properties are conserved (.e.g., should AAAAR cluster with only AAAAH and and AAAAK by virtue of all having positively charged side chains)?

Otherwise all of AAAAx should be a cluster where n is anything in your alphabet ACDEFGHIKLMNPQRSTVWY.

I'm just not clear yet on what you want the clusters to end up like, and what the demarcation between clusters is supposed to be (identity? similarity? does position specificity matter? i.e. is RAAAA similar to AAAAR.)

0
Entering edit mode

Hi Joe

my intentions are to cluster all motifs (20**5) in groups of shared/similar aac compostions, i.e. those that are permutations of a single sequence. there would have different groups of kmers like: a1b1c1d1e1 , a1b1c1d2 , a1b2c2 .... a5 . Each cluster would have to include compositions as a1b4, b1a4, a1c4 , etc. So, each cluster would include all kmers that are permutations of a given aac composition, e.g. abbbb, babbb, etc.

Sorry english is not my first language and some times it is hard to explain doubts in a meaningful way!

Hope now it gets a little more clear!

PS- I try many of the suggestions but didn't get any results with cd-hit! Maybe if I try change my code adding as you suggest a distance measure I could be done in python (which I would like very much).

1
Entering edit mode

Not sure now if you are still looking for cd-hit to work. But if I use your example I can cluster them. What command are you using...? This works  cd-hit -i yourinput -c 0.8 -o cdhitout -n 2 -l 2 

0
Entering edit mode

It is in my first question. I think i tried -c 0.8 but not -n2.

I will check gb.

Thank you

cd-hit -i pentamers -c 0.8 -o cdhitout -n 2 -l 2
================================================================
Program: CD-HIT, V4.8.1, Mar 01 2019, 14:14:47
Command: cd-hit -i pentamers -c 0.8 -o cdhitout -n 2 -l 2

Started: Thu Jan 30 11:29:48 2020
================================================================
Output
----------------------------------------------------------------
Your word length is 2, using 5 may be faster!

Fatal Error:
Failed to open the database file
Program halted !!

cd-hit -i pentamers -c 0.8 -o cdhitout -n 5 -l 2
================================================================
Program: CD-HIT, V4.8.1, Mar 01 2019, 14:14:47
Command: cd-hit -i pentamers -c 0.8 -o cdhitout -n 5 -l 2

Started: Thu Jan 30 11:30:15 2020
================================================================
Output
----------------------------------------------------------------

Fatal Error:
Too short -l, redefine it
Program halted !!

1
Entering edit mode

I will try to use a different file.

Now I got it with cd-hit.

cd-hit -i kmers_cdHit_test.faa -c 0.8 -o cdhitout -n 5 -l 4
================================================================
Program: CD-HIT, V4.8.1, Mar 01 2019, 14:14:47
Command: cd-hit -i kmers_cdHit_test.faa -c 0.8 -o cdhitout -n
5 -l 4

Started: Thu Jan 30 11:42:09 2020
================================================================
Output
----------------------------------------------------------------
total seq: 11
longest and shortest : 5 and 5
Total letters: 55
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 75M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 90520277

comparing sequences from          0  to         11

11  finished         11  clusters

Approximated maximum memory consumption: 75M
writing new database
writing clustering information
program completed !

Total CPU time 0.08

1
Entering edit mode

I can only recreate your error by giving an input file that does not exist

0
Entering edit mode

I think i give it a much bigger file ;)

1
Entering edit mode

that -n need to be smaller then 5 in your case. (and maybe even smaller then 4 depending on the lvl of similairity you want)This option mainly exist to improve speed but your case is a little more unique. It helps to make an "index" for faster clustering. I have no time now to explain it in detail, blast has a similar mechanism maybe you can learn from that if you really want to know. (you could search megablast vs blastn).

Also maybe increasing the memory limit helps. Try -M 0

0
Entering edit mode

Thank you gb.

I will check out your suggestions to understand the theory with blast.

Thanks for your time and patience. I am very grateful for all you guys have doing for my learning!

1
Entering edit mode

Yeah I don't think CD-HIT is going to work for that (not least because it does care about where in the sequence they differ (e.g. its unlikely to cluster AAAAB with BAAAA).

I think a brute force string searching approach is your only real option. Its easy enough to get string compositions with collections.Counter.

I don't imagine this is going to be quick on a file that large though.

3
Entering edit mode
21 months ago
gb ★ 2.0k

maybe ?  -l length of throw_away_sequences, default 10 

0
Entering edit mode

I will check it out.

0
Entering edit mode

As gb said, at the very least, you need to alter -l, because it seems your sequences are shorter than the default:

-l  length of throw_away_sequences, default 10


You may also need to tweak some other parameters to get meaningful results from such short sequences.

However, what are you doing doesn't make a lot of sense to me. Maybe if you tell what is your goal, we can come up with alternative approaches.

1
Entering edit mode

To add to this as a tip, try to do a global alignment with for example:

>8
AAAAK
>9
AAAAL


And check the identity score and check your own parameters, just a tip. https://www.ebi.ac.uk/Tools/psa/emboss_needle/

0
Entering edit mode

h.mon I see you are a brazilian, can I hit you in a email?

Thanks

1
Entering edit mode

I am not sure contacting me directly will be of any help, I am already over-worked and stretched thin. Besides, keeping communication on the forum increases the chance you will get good suggestions - is there a reason to keep it private?

0
Entering edit mode

No but change some ideas in Portuguese would be nice and maybe easy. But I would love to get some papers, books or suggestions for a data bank analysis. I have some data in counting aac, kmers and I would love to have some tutorial or as a said some tips on doing some stats on it. Because I am not expert in stats.

Paulo