Question: genePredToGtf wrong output
0
gravatar for Hangthelaundry
7 days ago by
Hangthelaundry10 wrote:

Hi all,

I am trying to analyze the deposition of a histone modification in ChIP-seq. So I got the union peak bed file, it looks like this:

chr1    5162780 5163619
chr1    6205640 6214319
chr1    6215300 6215579
chr1    6217540 6218379

to count the fragments on the peaks, I want to use featureCounts. FeatureCounts only accept the annotation file as gtf. I added the 4th column to bed file so that it can pass bedToGenePred

  $ awk '{$4="." print $0}' union_peaks.bed > union_peaks_1.bed

I get

chr1    5162780 5163619    .
chr1    6205640 6214319    .
chr1    6215300 6215579    .
chr1    6217540 6218379    .

Then I convert the bed file to gtf by

$ bedToGenePred union_peaks_1.bed stdout | genePredToGtf file stdin union_peaks.gtf

But when I check the column number of this gtf file

$ awk '{print NF}' union_peaks.gtf

it shows 16.

Then I found out the features are created as separate columns instead of one column 9, for instance, column 9 is gene_id, column 10 is "._5", column 11 is transcript_id, and so on and so forth.

chr1    stdin   exon    4823701 4824259 .       +       .       gene_id "._5"; transcript_id "._5"; exon_number "1"; exon_id "._5.1";
chr1    stdin   CDS     4823701 4824259 .       +       0       gene_id "._5"; transcript_id "._5"; exon_number "1"; exon_id "._5.1";
chr1    stdin   start_codon     4823701 4823703 .       +       0       gene_id "._5"; transcript_id "._5"; exon_number "1"; exon_id "._5.1";
chr1    stdin   transcript      4829161 4831959 .       +       .       gene_id "._6"; transcript_id "._6"; 
chr1    stdin   exon    4829161 4831959 .       +       .       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   CDS     4829161 4831956 .       +       0       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   start_codon     4829161 4829163 .       +       0       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   stop_codon      4831957 4831959 .       +       0       gene_id "._6"; transcript_id "._6"; exon_number "1"; exon_id "._6.1";
chr1    stdin   transcript      4857301 4891039 .       +       .       gene_id "._7"; transcript_id "._7";

Can you help me find out what is wrong here, and how to solve the problem?

chip-seq • 60 views
ADD COMMENTlink modified 7 days ago by h.mon30k • written 7 days ago by Hangthelaundry10
1
gravatar for h.mon
7 days ago by
h.mon30k
Brazil
h.mon30k wrote:

featureCounts accepts annotation in a format called simple annoation format, or SAF. From the union_peaks_1.bed, you can get a SAF file simply adding a column with "peak names" before the other columns:

awk '{print "peak_"NR"\t"$0}' union_peaks_1.bed > union_peaks_1.saf

edit: check the featureCounts documentation for more information about the SAF format.

Also, check Converting from BED to SAF/GFF

ADD COMMENTlink modified 7 days ago • written 7 days ago by h.mon30k
1

Thank you for your answer, h.mon. It saved a lot of time for me! It is smart to use SAF instead of GTF, and it works now!

I just want to add for other people that featureCounts has very stringent requirement to SAF feature (first column), first column with pure number, "peak_NR", "peak.NR", were all not accepted, so

awk 'OFS="\t" {print $1"."$2"."$3, $1, $2, $3, "."}'  union_peaks.merge.bed > union_peaks.merge.saf

would do the job in the end.

Many thanks again!

ADD REPLYlink modified 6 days ago • written 6 days ago by Hangthelaundry10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1701 users visited in the last hour