Finding common genes
1
0
Entering edit mode
8.9 years ago
div ▴ 60

Hi all,

I have two list of genes in txt format from S. aureus genome which are generated from RAST Server and PGAAP Pipeline.

Now I need to sort this list by which I can get more number of common genes from both the files.

Is there any way to check for each gene in one file whether it is present in the other file?

Linux or windows suggestions are welcome.

Thank you in advance

gene • 2.6k views
ADD COMMENT
1
Entering edit mode

Do you need to get the common lines (genes) between two files? If the answer is yes, read about sort and comm linux commands.

ADD REPLY
1
Entering edit mode

How is the format of your lists? There are many ways of doing this, but it help to know how your data is formated.

ADD REPLY
0
Entering edit mode

The gene list is alphanumeric which also contains special characters. In the given list I should remove all the numbers and also special characters and after removing all these it should only contain alphabates except for the EC, FIG.

Example of the gene list:

Formiminoglutamase (EC 3.5.3.8)
FIG01108370: hypothetical protein
Ribose 5-phosphate isomerase A (EC 5.3.1.6)
Uncharacterized protein conserved in bacteria
Aldose 1-epimerase (EC 5.1.3.3)
Protein of unknown function UPF0060
ABC-type Na+ efflux pump, permease component
ABC-type transport system, ATPase component
FIG01108220: hypothetical protein
DNA-3-methyladenine glycosylase II (EC 3.2.2.21)
Sodium/glutamate symporter
Isopentenyl-diphosphate delta-isomerase, FMN-dependent (EC 5.3.3.2)
Magnesium and cobalt transport protein CorA
3-hydroxyacyl-CoA dehydrogenase (EC 1.1.1.35)
FIG01107838: hypothetical protein
putative esterase
FIG01108339: hypothetical protein
Membrane component of multidrug resistance system
Multidrug resistance protein [function not yet clear]
TetR family regulatory protein of MDR cluster
Teicoplanin resistance associated membrane protein TcaB
Teicoplanin resistance associated membrane protein TcaA
Teicoplanin-resistance associated HTH-type transcriptional regulator TcaR

ADD REPLY
0
Entering edit mode

Did you try rbagnall solution? Should work, or get you pretty close to what you want.

ADD REPLY
0
Entering edit mode
8.9 years ago
rbagnall ★ 1.8k
grep -wFf file1 file2

Grep A Pattern From File

ADD COMMENT
0
Entering edit mode

I think for a list of gene names like his, grep -xFf file1 file2 is better.

ADD REPLY

Login before adding your answer.

Traffic: 2228 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6