extracting a specific field from .bed file and sort into new file
3
0
Entering edit mode
6.5 years ago

I have a test.bed file that contains the following:

chr12   85430091    85657002    ENSG00000133640.14  .   +   HAVANA  gene    .   ID=ENSG00000133640.14;gene_id=ENSG00000133640.14;transcript_id=ENSG00000133640.14;gene_type=protein_coding;gene_status=KNOWN;gene_name=LRRIQ1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=LRRIQ1;level=1;havana_gene=OTTHUMG00000166185.4
chr12   85657321    85658235    ENSG00000269916.1   .   +   HAVANA  gene    .   ID=ENSG00000269916.1;gene_id=ENSG00000269916.1;transcript_id=ENSG00000269916.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-193M21.1;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-193M21.1;level=2;havana_gene=OTTHUMG00000184141.1
chr12   85673884    85695562    ENSG00000180318.3   .   +   HAVANA  gene    .   ID=ENSG00000180318.3;gene_id=ENSG00000180318.3;transcript_id=ENSG00000180318.3;gene_type=protein_coding;gene_status=KNOWN;gene_name=ALX1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=ALX1;level=2;havana_gene=OTTHUMG00000169820.1
chr12   85711837    85736690    ENSG00000258815.1   .   +   HAVANA  gene    .   ID=ENSG00000258815.1;gene_id=ENSG00000258815.1;transcript_id=ENSG00000258815.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-408B11.2;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-408B11.2;level=2;havana_gene=OTTHUMG00000170606.1

I would like to extract gene names from the field "gene_name =" and sort into a new bed file. How would I go about doing that in shell/bash using bedops or not. I have multiple .bed files that I need to extract gene_name and sort into a new file. Thank you very much

bed bash shell script python • 3.7k views
ADD COMMENT
0
Entering edit mode

What do you want the output to look like? Can you provide a line or two of where you want the name to go etc.?

ADD REPLY
0
Entering edit mode

the new file should contain list of all gene names (from gene_name field), nothing else. Sorry, I just realized while copy pasting it cut off some fields. Here is the complete details for one gene from which i need just the gene name

chr12 85065376 85065805 ENSG00000257296.1 . + HAVANA gene . ID=ENSG00000257296.1;gene_id=ENSG00000257296.1;transcript_id=ENSG00000257296.1;gene_type=pseudogene;gene_status=KNOWN;gene_name=RP11-701B6.1;transcript_type=pseudogene;transcript_status=KNOWN;transcript_name=RP11-701B6.1;level=1;tag=pseudo_consens;havana_gene=OTTHUMG00000169741.1

ADD REPLY
2
Entering edit mode
6.5 years ago

You could use awk:

$ awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' in.bed | sort > out.txt

On your test input, this gives:

$ cat out.txt
ALX1
LRRIQ1
RP11-193M21.1
RP11-408B11.2

For multiple sorted BED files, you could take their multiset union:

$ bedops -u in1.bed in2.bed in3.bed ... inN.bed > inUnion.bed

Then, run awk on the unioned set:

$ awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' inUnion.bed | sort > out.txt

Or pipe directly from the output of bedops:

$ bedops -u in1.bed in2.bed in3.bed ... inN.bed | awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' | sort > out.txt

If you don't care about sort order, you could cat the BED files and pipe directly:

$ cat in1.bed in2.bed in3.bed ... inN.bed | awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' | sort > out.txt

Etc.

If you're using OS X, this would assume Homebrew-installed gawk, i.e., brew install gawk.

ADD COMMENT
0
Entering edit mode

Thank you so much, it worked! :)

ADD REPLY
2
Entering edit mode
6.5 years ago

OP:

the new file should contain list of all gene names (from gene_name field), nothing else.

Extracting gene names only and sort:

$ awk -F"[=;]" '{print $12}' test.txt | sort

or

$ cut -f6 -d";" test.txt | cut -f2 -d"=" | sort

Output:

ALX1
LRRIQ1
RP11-193M21.1
RP11-408B11.2

Input:

$ cat test.txt 
chr12   85430091    85657002    ENSG00000133640.14  .   +   HAVANA  gene    .   ID=ENSG00000133640.14;gene_id=ENSG00000133640.14;transcript_id=ENSG00000133640.14;gene_type=protein_coding;gene_status=KNOWN;gene_name=LRRIQ1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=LRRIQ1;level=1;havana_gene=OTTHUMG00000166185.4
chr12   85657321    85658235    ENSG00000269916.1   .   +   HAVANA  gene    .   ID=ENSG00000269916.1;gene_id=ENSG00000269916.1;transcript_id=ENSG00000269916.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-193M21.1;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-193M21.1;level=2;havana_gene=OTTHUMG00000184141.1
chr12   85673884    85695562    ENSG00000180318.3   .   +   HAVANA  gene    .   ID=ENSG00000180318.3;gene_id=ENSG00000180318.3;transcript_id=ENSG00000180318.3;gene_type=protein_coding;gene_status=KNOWN;gene_name=ALX1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=ALX1;level=2;havana_gene=OTTHUMG00000169820.1
chr12   85711837    85736690    ENSG00000258815.1   .   +   HAVANA  gene    .   ID=ENSG00000258815.1;gene_id=ENSG00000258815.1;transcript_id=ENSG00000258815.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-408B11.2;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-408B11.2;level=2;havana_gene=OTTHUMG00000170606.1

I have multiple .bed files that I need to extract gene_name and sort into a new file.

One file per bed or all genes in a single file?

If it is one file each for each bed and genes are sorted within each file ( assuming that bed files have .bed extension and they all are in the same folder and user has gnu-parallel installed):

$ ls *.bed |  parallel "awk -F'[;=]' '{print \$12}' {} |sort  > {.}.genes.txt "

There will be multiple output files with "genes.txt" appended to original bed name (without .bed extension).

If it is a single file with all the genes and sorted:

$ awk -F "[;=]" '{print $12}' *.bed | sort > genes_output.txt
ADD COMMENT
0
Entering edit mode

Thank you very much

ADD REPLY
1
Entering edit mode
6.5 years ago

use awk to extract the gene_name, create a new column, sort on this column, remove the column:

awk -F '\t' '{n=split($10,a,/[;]/);G="";for(i=1;i<=n;++i) if(substr(a[i],1,10)=="gene_name=") G=substr(a[i],11); printf("%s\t%s\n",G,$0);}' input.tsv | sort -t $'\t' -k1,1 | cut -f 2-
ADD COMMENT
0
Entering edit mode

Thank you very much!

ADD REPLY

Login before adding your answer.

Traffic: 2744 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6