How can I select my Chip-seq genes (targets) in my Rna-seq data?
4.7 years ago
LuisNagano ▴ 90

Hi, I need some help! I want to select my target genes generated by my ChIP-seq (HOMER - genes symbols) within my list of differentially expressed genes from RNA-seq (Cuffdiff output), so I can identify my target genes ids, “XLOC_” gene ids generated by cufflinks, (like select a complete line from my cuffdiff file where I find my target gene), this way I can plot the heatmap of these genes using RNA-seq data. How can I do this in a simple way?

->Annotated genes (HOMER output)

• SULT1B1
• LHFPL
• ZMYND8

->Cuffdiff table file (RNA-seq)

• gene_id gene symbol locus
• XLOC_000001 DDX11L1 chr1:11868-31109
• XLOC_000002 MIR1302-2 chr1:11868-31109
• XLOC_000003 OR4G4P chr1:52472-53312 ...
RNA-Seq ChIP-Seq Cufflinks Cuffdiff • 1.7k views
Hello LuisNagano!

What do you mean with "in a simple way"? Does it mean that you know how to use R for instance?

With R you can do this simple namely, your Homer output seem to be gene symbols, which I recognize in the second column of your cuffdiff table.

So import the files in R and use for example: %in%

4.7 years ago

It looks like your target genes in ChIP-seq (HOMER output) and gene symbols in Cuffdiff table are in HGNC notation http://www.genenames.org/ So they should match exactly.

In order to select all lines from cuffdiff output that have gene names from the list of HOMER output in the second column you can run:

awk 'NR==FNR{HOMER_LIST[$1]=$1}(NR!=FNR&&HOMER_LIST[$2]){print$0}' homer_output_file cuffdiff_output_file


NR==FNR is true when awk reads the first file. That way all gene names from the first file will be stored in memory as an array HOMER_LIST. NR!=FNR is true when awk read the second file. Using && it also tests if second column element $2 from each row of the second file can be found in HOMER_LIST. As a result awk prints to stdout the complete line$0 from cuffdiff file where it can find gene name from that line in the target gene file.

