genePredToGtf wrong output
1
0
Entering edit mode
3.8 years ago

Hi all,

I am trying to analyze the deposition of a histone modification in ChIP-seq. So I got the union peak bed file, it looks like this:

chr1    5162780 5163619
chr1    6205640 6214319
chr1    6215300 6215579
chr1    6217540 6218379

to count the fragments on the peaks, I want to use featureCounts. FeatureCounts only accept the annotation file as gtf. I added the 4th column to bed file so that it can pass bedToGenePred

  $ awk '{$4="." print $0}' union_peaks.bed > union_peaks_1.bed

I get

chr1    5162780 5163619    .
chr1    6205640 6214319    .
chr1    6215300 6215579    .
chr1    6217540 6218379    .

Then I convert the bed file to gtf by

$ bedToGenePred union_peaks_1.bed stdout | genePredToGtf file stdin union_peaks.gtf

But when I check the column number of this gtf file

$ awk '{print NF}' union_peaks.gtf

it shows 16.

Then I found out the features are created as separate columns instead of one column 9, for instance, column 9 is gene_id, column 10 is "._5", column 11 is transcript_id, and so on and so forth.

chr1    stdin   exon    4823701 4824259 .       +       .       gene_id "._5"; transcript_id "._5"; exon_number "1"; exon_id "._5.1";
chr1    stdin   CDS     4823701 4824259 .       +       0       gene_id "._5"; transcript_id "._5"; exon_number "1"; exon_id "._5.1";
chr1    stdin   start_codon     4823701 4823703 .       +       0       gene_id "._5"; transcript_id "._5"; exon_number "1"; exon_id "._5.1";
chr1    stdin   transcript      4829161 4831959 .       +       .       gene_id "._6"; transcript_id "._6"; 
chr1    stdin   exon    4829161 4831959 .       +       .       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   CDS     4829161 4831956 .       +       0       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   start_codon     4829161 4829163 .       +       0       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   stop_codon      4831957 4831959 .       +       0       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   transcript      4857301 4891039 .       +       .       gene_id "._7"; transcript_id "._7";

Can you help me find out what is wrong here, and how to solve the problem?

ChIP-Seq • 1.0k views
ADD COMMENT
1
Entering edit mode
3.8 years ago
h.mon 35k

featureCounts accepts annotation in a format called simple annoation format, or SAF. From the union_peaks_1.bed, you can get a SAF file simply adding a column with "peak names" before the other columns:

awk '{print "peak_"NR"\t"$0}' union_peaks_1.bed > union_peaks_1.saf

edit: check the featureCounts documentation for more information about the SAF format.

Also, check Converting from BED to SAF/GFF

ADD COMMENT
1
Entering edit mode

Thank you for your answer, h.mon. It saved a lot of time for me! It is smart to use SAF instead of GTF, and it works now!

I just want to add for other people that featureCounts has very stringent requirement to SAF feature (first column), first column with pure number, "peak_NR", "peak.NR", were all not accepted, so

awk 'OFS="\t" {print $1"."$2"."$3, $1, $2, $3, "."}'  union_peaks.merge.bed > union_peaks.merge.saf

would do the job in the end.

Many thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6