Question: Extract flanking genes around sRNA from gff files
gravatar for CikLa
5.7 years ago by
CikLa90 wrote:


I have a batch of GFF files for bacteria genomes, which have been edited from the original GFF file from NCBI ftp site. In the edited GFF, I have inserted sRNA features. This is an example (dummy example) to illustrate the edited GFF:


##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_016077.1 1 2487765
NC_016077.1 RefSeq  region  1   2487765 .   +   .   ID=id0;Dbxref=taxon:568816;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;old-name=Acidaminococcus intestinalis RYC-MR95;strain=RyC-MR95
NC_016077.1 RefSeq  CDS 175 1551    .   +   0   ID=cds0;Name=YP_004895395.1;Parent=gene0;Note=COG0593;Dbxref=Genbank:YP_004895395.1,GeneID:11266123;gbkey=CDS;product=chromosomal replication initiator protein;protein_id=YP_004895395.1;transl_table=11
NC_016077.1 RefSeq  CDS 1626    1778    .   +   0   ID=cds1;Name=YP_004895396.1;Parent=gene1;Dbxref=Genbank:YP_004895396.1,GeneID:11264393;gbkey=CDS;product=hypothetical protein;protein_id=YP_004895396.1;transl_table=11
NC_016077.1 RefSeq  sRNA    1799    1825    .   +   0   ID=sRNA1;Name=sRNA1;Parent=gene2;Dbxref=Genbank:YP_004895397.1,GeneID:11264394;gbkey=sRNA;product=sRNA 1;protein_id=YP_004895397.1;transl_table=11
NC_016077.1 RefSeq  CDS 1829    2947    .   +   0   ID=cds2;Name=YP_004895397.1;Parent=gene2;Dbxref=Genbank:YP_004895397.1,GeneID:11264394;gbkey=CDS;product=DNA polymerase III;protein_id=YP_004895397.1;transl_table=11
NC_016077.1 RefSeq  CDS 2953    4101    .   +   0   ID=cds3;Name=YP_004895398.1;Parent=gene3;Dbxref=Genbank:YP_004895398.1,GeneID:11264395;gbkey=CDS;product=recombination protein F;protein_id=YP_004895398.1;transl_table=11


So the bold one is sRNA. I want to extract a gene before and a gene after the sRNA sequence, in this case they are cds1 and cds2. The output should be in tab delimited format so that I could extract the fasta sequence then. In a GFF, there are multiple sRNAs to extract and I have many GFF files to perform this task.

Anyone mind to to help?



srna flanking genes • 2.1k views
ADD COMMENTlink modified 2.9 years ago by Biostar ♦♦ 20 • written 5.7 years ago by CikLa90
gravatar for Alex Reynolds
5.7 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Use BEDOPS gff2bed to convert GFF to BED, awk to create CDS and sRNA subsets, and BEDOPS closest-features to extract the nearest CDS features to sRNA elements:

$ gff2bed < features.gff > features.bed
$ awk '$8=="CDS"' features.bed > CDS.bed
$ awk '$8=="sRNA"' features.bed > sRNA.bed
$ closest-features --no-ref --delim '\n' sRNA.bed CDS.bed | sort-bed - > nearest_CDS_per_sRNA.bed

Then you can take the BED file nearest_CDS_per_sRNA.bed and convert it to FASTA with bed2fasta or similar approaches.

If you need to filter by distance, add the --dist option, remove the --delim option, and use awk to place a distance threshold on results:

$ closest-features --no-ref --dist sRNA.bed CDS.bed \
    | awk -v threshold=200 '{ \
          n = split($0, a, "|"); \
          for (i = 0; i < n; i++) { \
              if ((i % 2) && (a[i+1] <= threshold)) { \
                  print a[i]"\t"a[i+1]; \
              } \
          } \
      }' - \
    | sort-bed - \
    > thresholded_nearest_CDS_per_sRNA.bed

This can be run directly on the command line (e.g., within a bash shell) or if you plan on doing this frequently, you could write this into a shell script or makefile.

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Alex Reynolds31k

Thanks for the help. I will try this one, looks like it might work well.Having said that I have a lot of GFF to be done, I believe it's just to run the script in bash file, am I right?

ADD REPLYlink written 5.7 years ago by CikLa90

Just curious here, if I am not mistaken, it will take the closest feature to the sRNA that we want right? In my case, I want the one upstream and one downstream of the sRNA, meaning that, for an sRNA, it should have 2 flanking genes.

Another thing is, what if I want to restrict the extraction to certain distance from my sRNA. Let say, 'extract the gene sits next & after the sRNA, provided their distance from sRNA (gap between gene & sRNA) is less than 200bp'? Any idea?

ADD REPLYlink written 5.7 years ago by CikLa90

Add --dist to report the distance in bases for each queried CDS from its sRNA. Then use awk to filter the CDS results by the distance threshold of your choice. Edit: See edited answer for code.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Alex Reynolds31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2195 users visited in the last hour