Question: How can I select my Chip-seq genes (targets) in my Rna-seq data?
gravatar for LuisNagano
3.6 years ago by
University of Campinas
LuisNagano40 wrote:

Hi, I need some help! I want to select my target genes generated by my ChIP-seq (HOMER - genes symbols) within my list of differentially expressed genes from RNA-seq (Cuffdiff output), so I can identify my target genes ids, “XLOC_” gene ids generated by cufflinks, (like select a complete line from my cuffdiff file where I find my target gene), this way I can plot the heatmap of these genes using RNA-seq data. How can I do this in a simple way?

->Annotated genes (HOMER output)

  • SULT1B1
  • ZMYND8
  • RAD23B ...

->Cuffdiff table file (RNA-seq)

  • gene_id gene symbol locus
  • XLOC_000001 DDX11L1 chr1:11868-31109
  • XLOC_000002 MIR1302-2 chr1:11868-31109
  • XLOC_000003 OR4G4P chr1:52472-53312 ...
ADD COMMENTlink modified 20 months ago by Biostar ♦♦ 20 • written 3.6 years ago by LuisNagano40

Hello LuisNagano!

It appears that your post has been cross-posted to another site:

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 3.6 years ago by WouterDeCoster44k

What do you mean with "in a simple way"? Does it mean that you know how to use R for instance?

With R you can do this simple namely, your Homer output seem to be gene symbols, which I recognize in the second column of your cuffdiff table.

So import the files in R and use for example: %in%

ADD REPLYlink written 3.6 years ago by Benn8.0k
gravatar for Petr Ponomarenko
3.6 years ago by
United States / Los Angeles /
Petr Ponomarenko2.6k wrote:

It looks like your target genes in ChIP-seq (HOMER output) and gene symbols in Cuffdiff table are in HGNC notation So they should match exactly.

In order to select all lines from cuffdiff output that have gene names from the list of HOMER output in the second column you can run:

awk 'NR==FNR{HOMER_LIST[$1]=$1}(NR!=FNR&&HOMER_LIST[$2]){print $0}' homer_output_file cuffdiff_output_file

NR==FNR is true when awk reads the first file. That way all gene names from the first file will be stored in memory as an array HOMER_LIST. NR!=FNR is true when awk read the second file. Using && it also tests if second column element $2 from each row of the second file can be found in HOMER_LIST. As a result awk prints to stdout the complete line $0 from cuffdiff file where it can find gene name from that line in the target gene file.

ADD COMMENTlink written 3.6 years ago by Petr Ponomarenko2.6k

Thanks Petr, works very well!

ADD REPLYlink written 3.6 years ago by LuisNagano40

If this answer was helpful it is appropriate to upvote it, and if this answer resolved your question completely you can 'accept' the answer, as such marking your question as solved.

ADD REPLYlink written 3.6 years ago by WouterDeCoster44k

how do I mark it as solved?

ADD REPLYlink written 3.6 years ago by LuisNagano40

It seems you (or someone else) already did that. Marking as solved is done by accepting the answer.

ADD REPLYlink written 3.6 years ago by WouterDeCoster44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1167 users visited in the last hour