Question: How can I select my Chip-seq genes (targets) in my Rna-seq data?
gravatar for LuisNagano
6 weeks ago by
University of Campinas
LuisNagano0 wrote:

Hi, I need some help! I want to select my target genes generated by my ChIP-seq (HOMER - genes symbols) within my list of differentially expressed genes from RNA-seq (Cuffdiff output), so I can identify my target genes ids, “XLOC_” gene ids generated by cufflinks, (like select a complete line from my cuffdiff file where I find my target gene), this way I can plot the heatmap of these genes using RNA-seq data. How can I do this in a simple way?

->Annotated genes (HOMER output)

  • SULT1B1
  • ZMYND8
  • RAD23B ...

->Cuffdiff table file (RNA-seq)

  • gene_id gene symbol locus
  • XLOC_000001 DDX11L1 chr1:11868-31109
  • XLOC_000002 MIR1302-2 chr1:11868-31109
  • XLOC_000003 OR4G4P chr1:52472-53312 ...
ADD COMMENTlink modified 6 weeks ago by Petr Ponomarenko1.2k • written 6 weeks ago by LuisNagano0

Hello LuisNagano!

It appears that your post has been cross-posted to another site:

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 6 weeks ago by WouterDeCoster14k

What do you mean with "in a simple way"? Does it mean that you know how to use R for instance?

With R you can do this simple namely, your Homer output seem to be gene symbols, which I recognize in the second column of your cuffdiff table.

So import the files in R and use for example: %in%

ADD REPLYlink written 6 weeks ago by b.nota2.6k
gravatar for Petr Ponomarenko
6 weeks ago by
United States / Los Angeles /
Petr Ponomarenko1.2k wrote:

It looks like your target genes in ChIP-seq (HOMER output) and gene symbols in Cuffdiff table are in HGNC notation So they should match exactly.

In order to select all lines from cuffdiff output that have gene names from the list of HOMER output in the second column you can run:

awk 'NR==FNR{HOMER_LIST[$1]=$1}(NR!=FNR&&HOMER_LIST[$2]){print $0}' homer_output_file cuffdiff_output_file

NR==FNR is true when awk reads the first file. That way all gene names from the first file will be stored in memory as an array HOMER_LIST. NR!=FNR is true when awk read the second file. Using && it also tests if second column element $2 from each row of the second file can be found in HOMER_LIST. As a result awk prints to stdout the complete line $0 from cuffdiff file where it can find gene name from that line in the target gene file.

ADD COMMENTlink written 6 weeks ago by Petr Ponomarenko1.2k

Thanks Petr, works very well!

ADD REPLYlink written 6 weeks ago by LuisNagano0

If this answer was helpful it is appropriate to upvote it, and if this answer resolved your question completely you can 'accept' the answer, as such marking your question as solved.

ADD REPLYlink written 6 weeks ago by WouterDeCoster14k

how do I mark it as solved?

ADD REPLYlink written 6 weeks ago by LuisNagano0

It seems you (or someone else) already did that. Marking as solved is done by accepting the answer.

ADD REPLYlink written 6 weeks ago by WouterDeCoster14k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 986 users visited in the last hour