Converting from BED to SAF/GFF
3
3
Entering edit mode
5.9 years ago
ccag ▴ 60

I called peaks using MACS2 and now would like to use my narrowPeak file to do featureCounts. As far as I can tell, featureCOunts only uses SAF or GFF formats. Does anyone know an easy way to convert between a bed and these file types for Dm3?

bed gff saf featureCount • 10k views
ADD COMMENT
9
Entering edit mode
4.5 years ago
ATpoint 66k
awk 'OFS="\t" {print $1"."$2"."$3, $1, $2, $3, "."}' in.narrowPeak > out.saf

The first column is the identifier. This is then either any gene name or in case of genomic regions simply the concatenated genomic coordinates, e.g. separated by dot, such as chr1.100.1000000. For genomic regions I always set strand to ., for genes one can set the strand where the gene is located.

ADD COMMENT
0
Entering edit mode

I am wondering whether $2 should +1 because the narrowPeak is 0-based. But Actually I rarely find the description about SAF coordinate system :).

ADD REPLY
2
Entering edit mode

In 6.2.2. of https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf it says that SAF format is inclusive for both start and end coordinate. I interpret this as "leave narrowPeak as it is". I noted that if you use e.g. bedtools makeWindows to make windows across the whole genome, and if you then transform to SAF and add 1 to start you will not have 100% reads assigned to that SAF but only 99.x% because that one start nucleotide is missing. WIthout modification to that BED you assign 100%. Not sure if this is a correct statement, I am always a bit confused with these different coordinate systems. I also never understood why the featureCounts developers could not simply use BED format instead of this SAF format.

ADD REPLY
0
Entering edit mode

Thanks for your reply :). I will see it.

ADD REPLY
1
Entering edit mode

This SAF is 1-based coordinate system, yes, it should add 1. I have checked the example gene listed in the featureCounts manual.

ADD REPLY
0
Entering edit mode
5.9 years ago
Jeffin Rockey ★ 1.2k

Option 1 :GenomeTools

Option 2: Go to galaxy oqtans page here and on the left pane there is "GFF Toolkit" which has BED_to_GFF3 converter .

Option 3: A long route. From here make use of bedToGenePred followed by genePredToGtf to get a gtf file at first. Then you can make use of the lot of tools that does a gtf to gff3 conversion.

I am mentioning multiple options because depending on whether you bed file is 6 column or 12 column or so, some may not be applicable.

ADD COMMENT
0
Entering edit mode
24 days ago
mt1022 ▴ 270

I believe that SAF format use 1-based coordinates that are closed on both ends. Here is how I got this conclusion.

First make some toy data.

$ cat genome.fa
>chr1
AATTCCGGAAAATTTTCCCCGGGGAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCC

$ cat reads.fa
>q1
AAAATTTTCCCCGGGGAAAAAAAAAAAAAAAAAACC

Map reads to the genome:

$ STAR --runMode genomeGenerate --genomeDir test_star --genomeFastaFiles genome.fa --genomeSAindexNbases 2
$ STAR --genomeDir test_star --readFilesIn reads.fa --outFileNamePrefix reads_

$ cat reads_Aligned.out.sam
@HD VN:1.4
@SQ SN:chr1 LN:70
@PG ID:STAR PN:STAR VN:2.7.10a  CL:STAR   --genomeDir test_star   --readFilesIn reads.fa      --outFileNamePrefix reads_
@CO user command line: STAR --genomeDir test_star --readFilesIn reads.fa --outFileNamePrefix reads_
q1  0   chr1    9   255 36M *   0   0   AAAATTTTCCCCGGGGAAAAAAAAAAAAAAAAAACC    *   NH:i:1  HI:i:1  AS:i:35 nM:i:0

The read was mapped to 9-44 nucleotides of chr1:

$ bedtools bamtobed -i reads_Aligned.out.sam
chr1    8   44  q1  255 +

Now we test whether several SAF intervals have overlapping reads.

$ featureCounts -s1 -a f1.saf -F SAF -o cnts1.txt reads_Aligned.out.sam
$ featureCounts -s1 -a f2.saf -F SAF -o cnts2.txt reads_Aligned.out.sam
$ featureCounts -s1 -a f3.saf -F SAF -o cnts3.txt reads_Aligned.out.sam
$ featureCounts -s1 -a f4.saf -F SAF -o cnts4.txt reads_Aligned.out.sam

$ cat f1.saf cnts1.txt
r1  chr1    5   8   +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f1.saf" "-F" "SAF" "-o" "cnts1.txt" "reads_Aligned.out.sam"
Geneid  Chr Start   End Strand  Length  reads_Aligned.out.sam
r1  chr1    5   8   +   4   0

$ cat f2.saf cnts2.txt
r1  chr1    5   9   +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f2.saf" "-F" "SAF" "-o" "cnts2.txt" "reads_Aligned.out.sam"
Geneid  Chr Start   End Strand  Length  reads_Aligned.out.sam
r1  chr1    5   9   +   5   1

$ cat f3.saf cnts3.txt
r1  chr1    44  50  +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f3.saf" "-F" "SAF" "-o" "cnts3.txt" "reads_Aligned.out.sam"
Geneid  Chr Start   End Strand  Length  reads_Aligned.out.sam
r1  chr1    44  50  +   7   1

$ cat f4.saf cnts4.txt
r1  chr1    45  50  +
# Program:featureCounts v2.0.1; Command:"featureCounts" "-s1" "-a" "f4.saf" "-F" "SAF" "-o" "cnts4.txt" "reads_Aligned.out.sam"
Geneid  Chr Start   End Strand  Length  reads_Aligned.out.sam
r1  chr1    45  50  +   6   0

We can see that the SAF 5-9 and SAF 44-50 have overlapping reads while SAF 5-8 and SAF 45-50 not. Therefore, the start and end in SAF are both 1-based and inclusive as in GTF format. Therefore, when converting bed to SAF, the bed start coordinates should plus 1 and bed end coordinates need no change.

ADD COMMENT

Login before adding your answer.

Traffic: 1569 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6