dear all, iam new to bioinformatics and iam stuck in my project, I have Prokka results from 7 different metagenomic assemblies of different varieties of Azolla fern. i have collected list of uniprotIDs from every bin of 7 different metagenomic assemblies. Now i want to find out which uniprotIDs are shared across these different metagenomic assemblies?. because the number of uniprotIDs in every bins of these assemblies is in several thousands and i cannot manually check it. And second question is how can i use these uniprotIDs in R programme or PCA. ANY help or guidance will be highly appreciated. thanks manpy
What format is the result data in? Columnar or just text collection of uniprot ID's in file(s)?
It's is in column a list of all uniprot IDs I used grep to only view uniprot id from gff files of my prokka results
Are these in separate files? You can
sort
anduniq
the files to get non-redundant lists. Then next step would be to usejoin
(orcomm
) to identify shared ID's. Give that a try and post if you run into problems.They are in separate files for each bin of each metagenomic assembly. Ok I will try . Thank you so much sir.
Iam using
cat --l 'UniProt.*' | sort -d | uniq --count/path/to/list ofuniprot/
still iam getting in several thousands its only numbering them, the number of times they are repeated in the genome in each bins i using this command for every bin in different metagenomic assemblies because each bins corresponds to a specific organism (bacteria) so thats why iam not combining all UniProt ids in each metagenomic assemblymy question is how to make these tables presentable for my report and how can i use them for PcA