Question

extracting a specific field from .bed file and sort into new file

0

Entering edit mode

6.5 years ago

g.sathish.k • 0

I have a test.bed file that contains the following:

chr12   85430091    85657002    ENSG00000133640.14  .   +   HAVANA  gene    .   ID=ENSG00000133640.14;gene_id=ENSG00000133640.14;transcript_id=ENSG00000133640.14;gene_type=protein_coding;gene_status=KNOWN;gene_name=LRRIQ1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=LRRIQ1;level=1;havana_gene=OTTHUMG00000166185.4
chr12   85657321    85658235    ENSG00000269916.1   .   +   HAVANA  gene    .   ID=ENSG00000269916.1;gene_id=ENSG00000269916.1;transcript_id=ENSG00000269916.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-193M21.1;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-193M21.1;level=2;havana_gene=OTTHUMG00000184141.1
chr12   85673884    85695562    ENSG00000180318.3   .   +   HAVANA  gene    .   ID=ENSG00000180318.3;gene_id=ENSG00000180318.3;transcript_id=ENSG00000180318.3;gene_type=protein_coding;gene_status=KNOWN;gene_name=ALX1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=ALX1;level=2;havana_gene=OTTHUMG00000169820.1
chr12   85711837    85736690    ENSG00000258815.1   .   +   HAVANA  gene    .   ID=ENSG00000258815.1;gene_id=ENSG00000258815.1;transcript_id=ENSG00000258815.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-408B11.2;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-408B11.2;level=2;havana_gene=OTTHUMG00000170606.1

I would like to extract gene names from the field "gene_name =" and sort into a new bed file. How would I go about doing that in shell/bash using bedops or not. I have multiple .bed files that I need to extract gene_name and sort into a new file. Thank you very much

bed bash shell script python • 3.7k views

ADD COMMENT • link updated 6.5 years ago by Pierre Lindenbaum 161k • written 6.5 years ago by g.sathish.k • 0

0

Entering edit mode

What do you want the output to look like? Can you provide a line or two of where you want the name to go etc.?

ADD REPLY • link 6.5 years ago by Alex Reynolds 35k

0

Entering edit mode

the new file should contain list of all gene names (from gene_name field), nothing else. Sorry, I just realized while copy pasting it cut off some fields. Here is the complete details for one gene from which i need just the gene name

chr12 85065376 85065805 ENSG00000257296.1 . + HAVANA gene . ID=ENSG00000257296.1;gene_id=ENSG00000257296.1;transcript_id=ENSG00000257296.1;gene_type=pseudogene;gene_status=KNOWN;gene_name=RP11-701B6.1;transcript_type=pseudogene;transcript_status=KNOWN;transcript_name=RP11-701B6.1;level=1;tag=pseudo_consens;havana_gene=OTTHUMG00000169741.1

ADD REPLY • link 6.5 years ago by g.sathish.k • 0

2

Entering edit mode

6.5 years ago

cpad0112 21k

OP:

the new file should contain list of all gene names (from gene_name field), nothing else.

Extracting gene names only and sort:

$ awk -F"[=;]" '{print $12}' test.txt | sort

or

$ cut -f6 -d";" test.txt | cut -f2 -d"=" | sort

Output:

ALX1
LRRIQ1
RP11-193M21.1
RP11-408B11.2

Input:

$ cat test.txt 
chr12   85430091    85657002    ENSG00000133640.14  .   +   HAVANA  gene    .   ID=ENSG00000133640.14;gene_id=ENSG00000133640.14;transcript_id=ENSG00000133640.14;gene_type=protein_coding;gene_status=KNOWN;gene_name=LRRIQ1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=LRRIQ1;level=1;havana_gene=OTTHUMG00000166185.4
chr12   85657321    85658235    ENSG00000269916.1   .   +   HAVANA  gene    .   ID=ENSG00000269916.1;gene_id=ENSG00000269916.1;transcript_id=ENSG00000269916.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-193M21.1;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-193M21.1;level=2;havana_gene=OTTHUMG00000184141.1
chr12   85673884    85695562    ENSG00000180318.3   .   +   HAVANA  gene    .   ID=ENSG00000180318.3;gene_id=ENSG00000180318.3;transcript_id=ENSG00000180318.3;gene_type=protein_coding;gene_status=KNOWN;gene_name=ALX1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=ALX1;level=2;havana_gene=OTTHUMG00000169820.1
chr12   85711837    85736690    ENSG00000258815.1   .   +   HAVANA  gene    .   ID=ENSG00000258815.1;gene_id=ENSG00000258815.1;transcript_id=ENSG00000258815.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-408B11.2;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-408B11.2;level=2;havana_gene=OTTHUMG00000170606.1

I have multiple .bed files that I need to extract gene_name and sort into a new file.

One file per bed or all genes in a single file?

If it is one file each for each bed and genes are sorted within each file ( assuming that bed files have .bed extension and they all are in the same folder and user has gnu-parallel installed):

$ ls *.bed |  parallel "awk -F'[;=]' '{print \$12}' {} |sort  > {.}.genes.txt "

There will be multiple output files with "genes.txt" appended to original bed name (without .bed extension).

If it is a single file with all the genes and sorted:

$ awk -F "[;=]" '{print $12}' *.bed | sort > genes_output.txt

ADD COMMENT • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

Thank you very much

ADD REPLY • link 6.5 years ago by g.sathish.k • 0

1

Entering edit mode

6.5 years ago

Pierre Lindenbaum 161k

use awk to extract the gene_name, create a new column, sort on this column, remove the column:

awk -F '\t' '{n=split($10,a,/[;]/);G="";for(i=1;i<=n;++i) if(substr(a[i],1,10)=="gene_name=") G=substr(a[i],11); printf("%s\t%s\n",G,$0);}' input.tsv | sort -t $'\t' -k1,1 | cut -f 2-

ADD COMMENT • link 6.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you very much!

ADD REPLY • link 6.5 years ago by g.sathish.k • 0

score 2 · Accepted Answer · 2017-10-25

You could use awk:

$ awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' in.bed | sort > out.txt

On your test input, this gives:

$ cat out.txt
ALX1
LRRIQ1
RP11-193M21.1
RP11-408B11.2

For multiple sorted BED files, you could take their multiset union:

$ bedops -u in1.bed in2.bed in3.bed ... inN.bed > inUnion.bed

Then, run awk on the unioned set:

$ awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' inUnion.bed | sort > out.txt

Or pipe directly from the output of bedops:

$ bedops -u in1.bed in2.bed in3.bed ... inN.bed | awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' | sort > out.txt

If you don't care about sort order, you could cat the BED files and pipe directly:

$ cat in1.bed in2.bed in3.bed ... inN.bed | awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' | sort > out.txt

Etc.

If you're using OS X, this would assume Homebrew-installed gawk, i.e., brew install gawk.