Question: counting number of genes located on positive or negative strands
gravatar for arsilan324
4 months ago by
arsilan32470 wrote:

Hi all,

Maybe this is a very basic question but I must tell that I am not a bioinformatics person. I just need to count total number of positive and negative genes located on chromosome in .genes or .gff3 file.

Can you please comment how can I do that? Thanks in advance

htseq gff3 • 198 views
ADD COMMENTlink modified 4 months ago by mike-zx130 • written 4 months ago by arsilan32470
gravatar for Alex Reynolds
4 months ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Via BEDOPS, extract gene features from your GFF file and convert to BED:

$ awk '$3=="gene"' annotations.gff | gff2bed - > genes.bed

Then filter on strand and pass to wc -l to count the number of lines:

$ awk '$6=="+"' genes.bed | wc -l
*number of forward-strand genes*


$ awk '$6=="-"' genes.bed | wc -l
*number of reverse-strand genes*

Converting to BED is not strictly required, but it makes sense if you're doing counting operations, which are essentially set operations. BED is a better format for set operations than GFF.

ADD COMMENTlink modified 5 weeks ago • written 4 months ago by Alex Reynolds27k

Why bother converting the gff3 file to bed when the strand information is already present in the gff3 file?

awk '$3=="gene" && $7=="+"' annotations.gff | wc -l
awk '$3=="gene" && $7=="-"' annotations.gff | wc -l

Also, note, if you are using GFF3 files from NCBI the feature type in column 3 is not always gene. For example, pseudogenes have pseudogene in column 3. You can change the awk command to ($3=="gene"||$3=="pseudogene") will fix that.

ADD REPLYlink modified 4 months ago • written 4 months ago by vkkodali990

Thank you to both of you! It worked.. Thanks!!!

ADD REPLYlink written 4 months ago by arsilan32470
gravatar for mike-zx
4 months ago by
Mexico/Service of Alimentary Quality SENASICA
mike-zx130 wrote:

Here is a bash solution as well:

#First lets define a couple of variables to act as counters for each strand ( + & - )


#Now we create a loop with your lines of interest from the gff3 file as the changing variable

for line in $(cat file.gff3 | grep '.*gene.*[[:space:]]+[[:space:]]')
  forward=$(expr "$forward" + 1) #this will add 1 to the counter for each line found by grep

#Depending on your grep version you might need to scape the "+" character, BSD grep doesn't have to

#Same process but for the reverse strand

 for line in $(cat file.gff3 | grep '.*gene.*[[:space:]]-[[:space:]]')
   reverse=$(expr "$reverse" + 1)

#Finally we print the results

echo -e "\nGenes present in the forward (positive) strand: $forward"
echo -e "Genes present in the reverse (negative) strand: $reverse\n"

Hope this helps.

ADD COMMENTlink written 4 months ago by mike-zx130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 745 users visited in the last hour