Question: Extracting string from every other line
gravatar for zoegward
3.0 years ago by
zoegward100 wrote:

I have an ammended fasta file like so:

>ENST00000517147.1 ncrna chromosome:GRCh38:1:9437669:9437778:-1 gene:ENSG00000252956.1 gene_biotype:rRNA transcript_biotype:rRNA gene_symbol:RNA5SP40 description:RNA, 5S ribosomal pseudogene 40 [Source:HGNC Symbol;Acc:HGNC:42816]
>ENST00000576449.1 ncrna chromosome:GRCh38:CHR_HSCHR18_1_CTG1_1:50319002:50319120:1 gene:ENSG00000262132.1 gene_biotype:rRNA transcript_biotype:rRNA gene_symbol:RNA5SP458 description:RNA, 5S ribosomal pseudogene 458 [Source:HGNC Symbol;Acc:HGNC:43358]

and an ammended gtf file like so:

1       ENSEMBL gene    9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; level 3;
1       ENSEMBL transcript      9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; transcript_id "ENST00000517147.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; transcript_type "rRNA"; transcript_status "KNOWN"; transcript_name "RNA5SP40-201"; level 3; transcript_support_level "NA"; tag "basic";
1       ENSEMBL exon    9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; transcript_id "ENST00000517147.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; transcript_type "rRNA"; transcript_status "KNOWN"; transcript_name "RNA5SP40-201"; exon_number 1; exon_id "ENSE00002089424.1"; level 3; transcript_support_level "NA"; tag "basic";

I want to extract the gene_name from the gtf file i.e. RNA5SP40 and the corresponding ENSG** from either the gtf or fasta file and the print the matching fasta sequence on the following line i.e.:


I am a complete beginner at programming and don't really know where to start. I could probably use awk to extract the gene name and ENSG* from the same file but wouldn't know how to match this to print out the fasta sequence from the other file?? Please help!

sequencing sequence alignment • 866 views
ADD COMMENTlink modified 3.0 years ago by Macspider3.0k • written 3.0 years ago by zoegward100
gravatar for Macspider
3.0 years ago by
Vienna - BOKU
Macspider3.0k wrote:
cat FILE.fasta | sed -e s'/.*gene:/>/'| sed -e s'/gene_biotype.*gene_symbol://' | sed -e s'/description.*//' | awk -F " " '{if (substr($0, 0, 1)==">") {print $1"|"$2} else {print $0}}'

This will:

  1. substitute everything which comes before "gene:" with just ">"
  2. remove the part between "gene_biotype" and "gene_symbol"
  3. remove everything from "description" on
  4. concatenate the two strings you want with a pipe ("|") only in the fasta name (sequence stays as it is)


it is not guaranteed to work 100%, some of your sequences might have the name fields in different order (even though usually they don't). That is why it is usually useful to know a language with dictionary support (python, perl, etc) so that you can hash the string in a key:value pair and call back only what you want depending on the key.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Macspider3.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1699 users visited in the last hour