Question

genePredToGtf wrong output

0

Entering edit mode

3.8 years ago

Hangthelaundry ▴ 10

Hi all,

I am trying to analyze the deposition of a histone modification in ChIP-seq. So I got the union peak bed file, it looks like this:

chr1    5162780 5163619
chr1    6205640 6214319
chr1    6215300 6215579
chr1    6217540 6218379

to count the fragments on the peaks, I want to use featureCounts. FeatureCounts only accept the annotation file as gtf. I added the 4th column to bed file so that it can pass bedToGenePred

  $ awk '{$4="." print $0}' union_peaks.bed > union_peaks_1.bed

I get

chr1    5162780 5163619    .
chr1    6205640 6214319    .
chr1    6215300 6215579    .
chr1    6217540 6218379    .

Then I convert the bed file to gtf by

$ bedToGenePred union_peaks_1.bed stdout | genePredToGtf file stdin union_peaks.gtf

But when I check the column number of this gtf file

$ awk '{print NF}' union_peaks.gtf

it shows 16.

Then I found out the features are created as separate columns instead of one column 9, for instance, column 9 is gene_id, column 10 is "._5", column 11 is transcript_id, and so on and so forth.

chr1    stdin   exon    4823701 4824259 .       +       .       gene_id "._5"; transcript_id "._5"; exon_number "1"; exon_id "._5.1";
chr1    stdin   CDS     4823701 4824259 .       +       0       gene_id "._5"; transcript_id "._5"; exon_number "1"; exon_id "._5.1";
chr1    stdin   start_codon     4823701 4823703 .       +       0       gene_id "._5"; transcript_id "._5"; exon_number "1"; exon_id "._5.1";
chr1    stdin   transcript      4829161 4831959 .       +       .       gene_id "._6"; transcript_id "._6"; 
chr1    stdin   exon    4829161 4831959 .       +       .       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   CDS     4829161 4831956 .       +       0       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   start_codon     4829161 4829163 .       +       0       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   stop_codon      4831957 4831959 .       +       0       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   transcript      4857301 4891039 .       +       .       gene_id "._7"; transcript_id "._7";

Can you help me find out what is wrong here, and how to solve the problem?

ChIP-Seq • 1.0k views

ADD COMMENT • link updated 3.8 years ago by h.mon 35k • written 3.8 years ago by Hangthelaundry ▴ 10

score 1 · Answer 1 · 2020-06-30

1

Entering edit mode

3.8 years ago

h.mon 35k

featureCounts accepts annotation in a format called simple annoation format, or SAF. From the union_peaks_1.bed, you can get a SAF file simply adding a column with "peak names" before the other columns:

awk '{print "peak_"NR"\t"$0}' union_peaks_1.bed > union_peaks_1.saf

edit: check the featureCounts documentation for more information about the SAF format.

Also, check Converting from BED to SAF/GFF

ADD COMMENT • link 3.8 years ago by h.mon 35k

1

Entering edit mode

Thank you for your answer, h.mon. It saved a lot of time for me! It is smart to use SAF instead of GTF, and it works now!

I just want to add for other people that featureCounts has very stringent requirement to SAF feature (first column), first column with pure number, "peak_NR", "peak.NR", were all not accepted, so

awk 'OFS="\t" {print $1"."$2"."$3, $1, $2, $3, "."}'  union_peaks.merge.bed > union_peaks.merge.saf

would do the job in the end.

Many thanks again!

ADD REPLY • link 3.8 years ago by Hangthelaundry ▴ 10