Question: Extracting string from every other line
0
gravatar for zoegward
2.2 years ago by
zoegward60
zoegward60 wrote:

I have an ammended fasta file like so:

>ENST00000517147.1 ncrna chromosome:GRCh38:1:9437669:9437778:-1 gene:ENSG00000252956.1 gene_biotype:rRNA transcript_biotype:rRNA gene_symbol:RNA5SP40 description:RNA, 5S ribosomal pseudogene 40 [Source:HGNC Symbol;Acc:HGNC:42816]
GTCTATGGCCATTGCACCCTGAACGTGCCAGATCTTGTCTCATCTTGGAAGCTAAGCAGGGTTGGGCTTGGAGGGGAGGAGGGTGAACCTCAGTTCAGGTTACTTAGCCT
>ENST00000576449.1 ncrna chromosome:GRCh38:CHR_HSCHR18_1_CTG1_1:50319002:50319120:1 gene:ENSG00000262132.1 gene_biotype:rRNA transcript_biotype:rRNA gene_symbol:RNA5SP458 description:RNA, 5S ribosomal pseudogene 458 [Source:HGNC Symbol;Acc:HGNC:43358]
TTTCTATGGCATACCAACCTGAGTGTGCCCAGTCTCATCCAATCTCAGAACGTAAGCAGGATTGGGCCTGGTTAGAACTTGGATGGGAAAATGCCAGTTAAAATCTGTACTAAAAAATT

and an ammended gtf file like so:

1       ENSEMBL gene    9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; level 3;
1       ENSEMBL transcript      9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; transcript_id "ENST00000517147.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; transcript_type "rRNA"; transcript_status "KNOWN"; transcript_name "RNA5SP40-201"; level 3; transcript_support_level "NA"; tag "basic";
1       ENSEMBL exon    9437669 9437778 .       -       .       gene_id "ENSG00000252956.1"; transcript_id "ENST00000517147.1"; gene_type "rRNA"; gene_status "KNOWN"; gene_name "RNA5SP40"; transcript_type "rRNA"; transcript_status "KNOWN"; transcript_name "RNA5SP40-201"; exon_number 1; exon_id "ENSE00002089424.1"; level 3; transcript_support_level "NA"; tag "basic";

I want to extract the gene_name from the gtf file i.e. RNA5SP40 and the corresponding ENSG** from either the gtf or fasta file and the print the matching fasta sequence on the following line i.e.:

RNA5SP40|ENSG00000252956.1
GTCTATGGCCATTGCACCCTGAACGTGCCAGATCTTGTCTCATCTTGGAAGCTAAGCAGGGTTGGGCTTGGAGGGGAGGAGGGTGAACCTCAGTTCAGGTTACTTAGCCT

I am a complete beginner at programming and don't really know where to start. I could probably use awk to extract the gene name and ENSG* from the same file but wouldn't know how to match this to print out the fasta sequence from the other file?? Please help!

sequencing sequence alignment • 680 views
ADD COMMENTlink modified 2.2 years ago by Macspider2.9k • written 2.2 years ago by zoegward60
2
gravatar for Macspider
2.2 years ago by
Macspider2.9k
Vienna - BOKU
Macspider2.9k wrote:
cat FILE.fasta | sed -e s'/.*gene:/>/'| sed -e s'/gene_biotype.*gene_symbol://' | sed -e s'/description.*//' | awk -F " " '{if (substr($0, 0, 1)==">") {print $1"|"$2} else {print $0}}'

This will:

  1. substitute everything which comes before "gene:" with just ">"
  2. remove the part between "gene_biotype" and "gene_symbol"
  3. remove everything from "description" on
  4. concatenate the two strings you want with a pipe ("|") only in the fasta name (sequence stays as it is)

    >ENSG00000262132.1|RNA5SP458 TTTCTATGGCATACCAACCTGAGTGTGCCCAGTCTCATCCAATCTCAGAACGTAAGCAGGATTGGGCCTGGTTAGAACTTGGATGGGAAAATGCCAGTTAAAATCTGTACTAAAAAATT

it is not guaranteed to work 100%, some of your sequences might have the name fields in different order (even though usually they don't). That is why it is usually useful to know a language with dictionary support (python, perl, etc) so that you can hash the string in a key:value pair and call back only what you want depending on the key.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Macspider2.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 797 users visited in the last hour