Question

Extract flanking genes around sRNA from gff files

0

Entering edit mode

8.9 years ago

CikLa ▴ 90

Hi,

I have a batch of GFF files for bacteria genomes, which have been edited from the original GFF file from NCBI ftp site. In the edited GFF, I have inserted sRNA features. This is an example (dummy example) to illustrate the edited GFF:

##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_016077.1 1 2487765
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=568816
NC_016077.1 RefSeq  region  1   2487765 .   +   .   ID=id0;Dbxref=taxon:568816;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;old-name=Acidaminococcus intestinalis RYC-MR95;strain=RyC-MR95
NC_016077.1 RefSeq  CDS 175 1551    .   +   0   ID=cds0;Name=YP_004895395.1;Parent=gene0;Note=COG0593;Dbxref=Genbank:YP_004895395.1,GeneID:11266123;gbkey=CDS;product=chromosomal replication initiator protein;protein_id=YP_004895395.1;transl_table=11
NC_016077.1 RefSeq  CDS 1626    1778    .   +   0   ID=cds1;Name=YP_004895396.1;Parent=gene1;Dbxref=Genbank:YP_004895396.1,GeneID:11264393;gbkey=CDS;product=hypothetical protein;protein_id=YP_004895396.1;transl_table=11
NC_016077.1 RefSeq  sRNA    1799    1825    .   +   0   ID=sRNA1;Name=sRNA1;Parent=gene2;Dbxref=Genbank:YP_004895397.1,GeneID:11264394;gbkey=sRNA;product=sRNA 1;protein_id=YP_004895397.1;transl_table=11
NC_016077.1 RefSeq  CDS 1829    2947    .   +   0   ID=cds2;Name=YP_004895397.1;Parent=gene2;Dbxref=Genbank:YP_004895397.1,GeneID:11264394;gbkey=CDS;product=DNA polymerase III;protein_id=YP_004895397.1;transl_table=11
NC_016077.1 RefSeq  CDS 2953    4101    .   +   0   ID=cds3;Name=YP_004895398.1;Parent=gene3;Dbxref=Genbank:YP_004895398.1,GeneID:11264395;gbkey=CDS;product=recombination protein F;protein_id=YP_004895398.1;transl_table=11

So the bold one is sRNA. I want to extract a gene before and a gene after the sRNA sequence, in this case they are cds1 and cds2. The output should be in tab delimited format so that I could extract the fasta sequence then. In a GFF, there are multiple sRNAs to extract and I have many GFF files to perform this task.

Anyone mind to to help?

Regards

sRNA • 3.0k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by CikLa ▴ 90

Ram · Answer 1 · 2015-05-29

3

Entering edit mode

8.9 years ago

Alex Reynolds 35k

Use BEDOPS gff2bed to convert GFF to BED, awk to create CDS and sRNA subsets, and BEDOPS closest-features to extract the nearest CDS features to sRNA elements:

$ gff2bed < features.gff > features.bed
$ awk '$8=="CDS"' features.bed > CDS.bed
$ awk '$8=="sRNA"' features.bed > sRNA.bed
$ closest-features --no-ref --delim '\n' sRNA.bed CDS.bed | sort-bed - > nearest_CDS_per_sRNA.bed

Then you can take the BED file nearest_CDS_per_sRNA.bed and convert it to FASTA with bed2fasta or similar approaches.

If you need to filter by distance, add the --dist option, remove the --delim option, and use awk to place a distance threshold on results:

$ closest-features --no-ref --dist sRNA.bed CDS.bed |
      awk -v threshold=200 '{
          n = split($0, a, "|");
          for (i = 0; I < n; i++) {
              if ((i % 2) && (a[i+1] <= threshold)) {
                  print a[i]"\t"a[i+1];
              }
          }
      }' - |
      sort-bed - > thresholded_nearest_CDS_per_sRNA.bed

This can be run directly on the command line (e.g., within a bash shell) or if you plan on doing this frequently, you could write this into a shell script or makefile.

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by Alex Reynolds 35k

0

Entering edit mode

Thanks for the help. I will try this one, looks like it might work well.Having said that I have a lot of GFF to be done, I believe it's just to run the script in bash file, am I right?

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by CikLa ▴ 90

0

Entering edit mode

Just curious here, if I am not mistaken, it will take the closest feature to the sRNA that we want right? In my case, I want the one upstream and one downstream of the sRNA, meaning that, for an sRNA, it should have 2 flanking genes.

Another thing is, what if I want to restrict the extraction to certain distance from my sRNA. Let say, 'extract the gene sits next & after the sRNA, provided their distance from sRNA (gap between gene & sRNA) is less than 200bp'? Any idea?

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by CikLa ▴ 90

0

Entering edit mode

Add --dist to report the distance in bases for each queried CDS from its sRNA. Then use awk to filter the CDS results by the distance threshold of your choice. Edit: See edited answer for code.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Alex Reynolds 35k