Question

How do you add gene annotations from a Diamond Blastx tsv file to a csv file with RNAseq differential expression data?

0

Entering edit mode

2.7 years ago

kristina.mahan ▴ 170

I have a Diamond BLASTx tsv file that has the Gene_Descriptions:

Diamond BLASTx gene annotations

I have a csv file of the RNAseq expression data:

enter image description here

I want to match the gene names (column E) and descriptions from the tsv file using the corresponding IDs from column B in the tsv file to the IDs (column A) in the csv file and insert them into column H of the csv file so that I have everything in one file and I don't have to cross reference. Can anybody point me to some command line scripts that can do this? Thanks in advance!!

BLASTx RNA-seq diamond • 1.8k views

ADD COMMENT • link updated 6 months ago by Ram 43k • written 2.7 years ago by kristina.mahan ▴ 170

0

Entering edit mode

Column E from the .tsv file and column A from the .csv file are nothing alike. Did you mean the respective columns B and A?

I have a python script that can merge the two files according to columns B from the first and A from the second file. That won't be exactly what you wanted, but they will all be in one file and you can later delete or move the columns as needed. Let me know if you want to try that.

ADD REPLY • link 2.7 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

ok yes! So it will just merge the files but rows will be matched based on the IDs? I'll try it out!

ADD REPLY • link 2.7 years ago by kristina.mahan ▴ 170

score 1 · Answer 1 · 2021-07-30

Assuming your .csv file is called RNAseq.csv and your .tsv file DIAMOND.tsv, and the column names are as you outlined above, this is a python script that will merge them into merged.csv.

import pandas as pd

f1 = pd.read_csv('RNAseq.csv')
f2 = pd.read_csv('DIAMOND.tsv', sep='\t')
f3 = f1.merge(f2, left_on='ID', right_on='Query_Contigs_ID', how='outer')
# The next line will drop the rows that have NAN values, which will be created when one file has an ID that the other doesn't
# Comment out or delete the next two lines if you want to keep all the rows regardless of NAN values
f3.dropna(axis=0, inplace=True)
f3 = f3.reset_index(drop=True)
f3.to_csv('merged.csv', index=False)

score 0 · Answer 2 · 2021-07-30

Both of these tables can be easily imported into any of Python, R or ...

The commands could be as simple as this in python:

df1 = pandas.read_table('/path/to/BLASTx.tsv') df2 = pandas.read_csv('/path/to/RNAseq.csv') pandas.merge(df2 , df1, on='ID') also you should have 'ID' columns have the same name after this operation you will have ALL the columns of both data files for the shared rows of column 'ID', non-shared rows will be dropped. if you want to keep all the rows of the left one or right one you try left join and right join. more details: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html