Question: Add taxonomy information of fileA to species list in fileB
0
gravatar for mariep92
2.2 years ago by
mariep920
mariep920 wrote:

Hello everyone, I am a young scientist who just started to work in the field of metagenomics. I would be very thankful if you might be able to help me with the following problem: I have the following two .csv files (comma-separated):

A working_file:

genus,species,column3,column4,column5,column6,column7
Staphylococcus,aureus,40000,3.0,7.0,6.0,3.0
Neisseria,gonorrhoea,2300,40.0,1.0,3.0,4.4
Vibrio,cholerae,2961,0,47.7,0.0,3.1,0.8
Pseudomonas,aeruginosa,64404,0,66.6,0.0,2.8,8.0

...

A taxonomy_file

domain,phylum,class,order,family,genus,species
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,aureus
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,capitis
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,saprophyticus
Bacteria,Proteobacteria,Gammaproteobacteria,Pseudomonadales,Pseudomonadaceae,Pseudomonas,aeruginosa
Bacteria,Proteobacteria,Gammaproteobacteria,Pseudomonadales,Pseudomonadaceae,Pseudomonas,brassicacearum

...

I would like to have a script (Python, R, Perl or Bash), which loops through the working_file line-by-line. Whenever the entries in column 1 and 2 of the working_file match the content of column 6 and 7 of the taxonomy_file, I want to add the taxonomy information (domain,phylum,class,order,family) as extra columns to the working_file.

Output file

domain,phylum,class,order,family,genus,species,column3,column4,column5,column6,column7
Bacteria,Firmictues,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,aureus,40000,3.0,7.0,6.0,3.0

...

Do you have any idea how to do that? Thank you very much in advance! Marie

taxonomy metagenomics • 611 views
ADD COMMENTlink modified 2.2 years ago by shenwei3565.8k • written 2.2 years ago by mariep920

Hello and welcome to biostars mariemadlen,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 2.2 years ago by finswimmer14k

Ah, I was wondering how to do that! This looks much better. Thank you very much indeed!

ADD REPLYlink written 2.2 years ago by mariep920

Do you just want to print out lines where there is a match? Or what should happen if there is no match?

ADD REPLYlink written 2.2 years ago by finswimmer14k

It would be helpful in the case of non-matching lines, if "NA" is printed into the otherwise empty columns. In this case, I know which species are still missing in my taxonomy file. My overall goal is to make the taxonomy file as complete as possible, so that I get an outcome for each species in the end.

ADD REPLYlink written 2.2 years ago by mariep920

Hello !

I am a young scientist who just started to work

No offense at all, but if you just start dealing with bioinformatics I suggest you to learn a text manipulation language as Python or Perl. Everyday you will have these kind of problematics and learning Python or Perl will save you a lot of time. For example see how to read and write in file in Python

Plus, you can take a look at the Unix commands (awk, sed...). I think your question can be solve in one line command in awk

ADD REPLYlink written 2.2 years ago by Bastien Hervé4.9k

Exactly, you are right. I started to learn and focus on Python and Linux/Bash a month ago, but I am still a beginner and it is difficult for me to understand how to tackle such problems. Therefore, I am very thankful to receive your suggestions until I learned enough to solve these issues on my own.

ADD REPLYlink written 2.2 years ago by mariep920

You got 2 lines in your working_file

Vibrio,cholerae,2961,0,47.7,0.0,3.1,0.8
Pseudomonas,aeruginosa,64404,0,66.6,0.0,2.8,8.0

Which have 8 attributes instead of 7

ADD REPLYlink written 2.2 years ago by Bastien Hervé4.9k
4
gravatar for finswimmer
2.2 years ago by
finswimmer14k
Germany
finswimmer14k wrote:

An awk solution:

$ awk -v FS="," -v OFS="," 'FNR==NR { working[$1 $2] = $3","$4","$5","$6","$7; next; } {if(working[$6 $7]) {print $0,working[$6 $7]} else {print $0,"NA","NA","NA","NA","NA"}}' working_file.txt taxonomy_file.txt
  • -v FS="," -v OFS="," defines the field seperator used in input and output file.

  • FNR==NR { working[$1 $2] = $3","$4","$5","$6","$7; next; }: as long as we read lines from the first file (working_file.txt) we create an associative array with the values of column 1 and 2 as key and the other columns as values.

  • {if(working[$6 $7]) {print $0,working[$6 $7]} else {print $0,"NA","NA","NA","NA","NA"}}': if we iterate over the second file (taxonomy_file.txt) we take the 6th and 7th column to create the key name and check if this exists in the array we created before. If so, we print out the whole line and append the values from working_file.txt. If not we fill the missed vallues by NA.

fin swimmer

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by finswimmer14k

A good answer and well explained.

I've tried a join-approach but it'll be way more complicated, due to sorting and key-merging.

ADD REPLYlink written 2.2 years ago by michael.ante3.6k

This was a very, very helpful answer! Due to your explanation, I was able to re-arrange the code in a way that it does exactly what I wanted initially. Thank you very, very much. This is the final code now:

$ awk -v FS="," -v OFS="," 'FNR==NR { taxo[$6 $7] = $1","$2","$3","$4","$5; next; } {if(taxo[$1 $2]) {print taxo[$1 $2],$0} else {print "NA","NA","NA","NA","NA",$0}}' taxonomy_file.txt working_file.txt > output_file.txt
ADD REPLYlink written 2.2 years ago by mariep920
2
gravatar for shenwei356
2.2 years ago by
shenwei3565.8k
China
shenwei3565.8k wrote:

Try csvtk join (left join).

$ csvtk join -k -f genus,species working_file taxonomy_file \
    | csvtk cut -f domain,phylum,class,order,family,genus,species,column3,column4,column5,column6,column7
ADD COMMENTlink written 2.2 years ago by shenwei3565.8k
0
gravatar for Bastien Hervé
2.2 years ago by
Bastien Hervé4.9k
Karolinska Institutet, Sweden
Bastien Hervé4.9k wrote:

A Python solution with comments, using pandas dataframes

#Import pandas
import pandas as pd

#Read each file as dataframe
#sep=',' => attributes separator is ,
#decimal='.' => number separator is .
#header=0 => The first line of the file will be used as column values
df_working = pd.read_csv("working_file.txt", sep=',', decimal=".", header=0)
df_taxonomy = pd.read_csv("taxonomy_file.txt", sep=',', decimal=".", header=0)

#Merge the two dataframes with a left joining, on genus and species and apply the result to df variable
df = df_taxonomy.merge(df_working, left_on=['genus','species'], right_on=['genus','species'], how='left')

#Write dataframe into output file
#sep=',' => attributes separator is ,
#header=True => Column names will be output
#na_rep='NA' => Cells without value will be output as NA
#index=False => Remove the index of the dataframe from the output
df.to_csv("output_file.txt", sep=',', header=True, na_rep='NA', index=False)

I also removed the last attribute of those two lines

Vibrio,cholerae,2961,0,47.7,0.0,3.1,0.8
Pseudomonas,aeruginosa,64404,0,66.6,0.0,2.8,8.0
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Bastien Hervé4.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1636 users visited in the last hour
_