Question: extracting a specific field from .bed file and sort into new file
0
gravatar for g.sathish.k
2.1 years ago by
g.sathish.k0 wrote:

I have a test.bed file that contains the following:

chr12   85430091    85657002    ENSG00000133640.14  .   +   HAVANA  gene    .   ID=ENSG00000133640.14;gene_id=ENSG00000133640.14;transcript_id=ENSG00000133640.14;gene_type=protein_coding;gene_status=KNOWN;gene_name=LRRIQ1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=LRRIQ1;level=1;havana_gene=OTTHUMG00000166185.4
chr12   85657321    85658235    ENSG00000269916.1   .   +   HAVANA  gene    .   ID=ENSG00000269916.1;gene_id=ENSG00000269916.1;transcript_id=ENSG00000269916.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-193M21.1;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-193M21.1;level=2;havana_gene=OTTHUMG00000184141.1
chr12   85673884    85695562    ENSG00000180318.3   .   +   HAVANA  gene    .   ID=ENSG00000180318.3;gene_id=ENSG00000180318.3;transcript_id=ENSG00000180318.3;gene_type=protein_coding;gene_status=KNOWN;gene_name=ALX1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=ALX1;level=2;havana_gene=OTTHUMG00000169820.1
chr12   85711837    85736690    ENSG00000258815.1   .   +   HAVANA  gene    .   ID=ENSG00000258815.1;gene_id=ENSG00000258815.1;transcript_id=ENSG00000258815.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-408B11.2;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-408B11.2;level=2;havana_gene=OTTHUMG00000170606.1

I would like to extract gene names from the field "gene_name =" and sort into a new bed file. How would I go about doing that in shell/bash using bedops or not. I have multiple .bed files that I need to extract gene_name and sort into a new file. Thank you very much

python shell script bash bed • 1.3k views
ADD COMMENTlink modified 2.1 years ago by Pierre Lindenbaum124k • written 2.1 years ago by g.sathish.k0

What do you want the output to look like? Can you provide a line or two of where you want the name to go etc.?

ADD REPLYlink written 2.1 years ago by Alex Reynolds29k

the new file should contain list of all gene names (from gene_name field), nothing else. Sorry, I just realized while copy pasting it cut off some fields. Here is the complete details for one gene from which i need just the gene name

chr12 85065376 85065805 ENSG00000257296.1 . + HAVANA gene . ID=ENSG00000257296.1;gene_id=ENSG00000257296.1;transcript_id=ENSG00000257296.1;gene_type=pseudogene;gene_status=KNOWN;gene_name=RP11-701B6.1;transcript_type=pseudogene;transcript_status=KNOWN;transcript_name=RP11-701B6.1;level=1;tag=pseudo_consens;havana_gene=OTTHUMG00000169741.1

ADD REPLYlink written 2.1 years ago by g.sathish.k0
2
gravatar for Alex Reynolds
2.1 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

You could use awk:

$ awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' in.bed | sort > out.txt

On your test input, this gives:

$ cat out.txt
ALX1
LRRIQ1
RP11-193M21.1
RP11-408B11.2

For multiple sorted BED files, you could take their multiset union:

$ bedops -u in1.bed in2.bed in3.bed ... inN.bed > inUnion.bed

Then, run awk on the unioned set:

$ awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' inUnion.bed | sort > out.txt

Or pipe directly from the output of bedops:

$ bedops -u in1.bed in2.bed in3.bed ... inN.bed | awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' | sort > out.txt

If you don't care about sort order, you could cat the BED files and pipe directly:

$ cat in1.bed in2.bed in3.bed ... inN.bed | awk 'match($0, /gene_name=(.*);transcript_type=.*/, a) {print a[1]}' | sort > out.txt

Etc.

If you're using OS X, this would assume Homebrew-installed gawk, i.e., brew install gawk.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Alex Reynolds29k

Thank you so much, it worked! :)

ADD REPLYlink written 2.1 years ago by g.sathish.k0
2
gravatar for cpad0112
2.1 years ago by
cpad011212k
India
cpad011212k wrote:

OP:

the new file should contain list of all gene names (from gene_name field), nothing else.

Extracting gene names only and sort:

$ awk -F"[=;]" '{print $12}' test.txt | sort

or

$ cut -f6 -d";" test.txt | cut -f2 -d"=" | sort

Output:

ALX1
LRRIQ1
RP11-193M21.1
RP11-408B11.2

Input:

$ cat test.txt 
chr12   85430091    85657002    ENSG00000133640.14  .   +   HAVANA  gene    .   ID=ENSG00000133640.14;gene_id=ENSG00000133640.14;transcript_id=ENSG00000133640.14;gene_type=protein_coding;gene_status=KNOWN;gene_name=LRRIQ1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=LRRIQ1;level=1;havana_gene=OTTHUMG00000166185.4
chr12   85657321    85658235    ENSG00000269916.1   .   +   HAVANA  gene    .   ID=ENSG00000269916.1;gene_id=ENSG00000269916.1;transcript_id=ENSG00000269916.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-193M21.1;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-193M21.1;level=2;havana_gene=OTTHUMG00000184141.1
chr12   85673884    85695562    ENSG00000180318.3   .   +   HAVANA  gene    .   ID=ENSG00000180318.3;gene_id=ENSG00000180318.3;transcript_id=ENSG00000180318.3;gene_type=protein_coding;gene_status=KNOWN;gene_name=ALX1;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=ALX1;level=2;havana_gene=OTTHUMG00000169820.1
chr12   85711837    85736690    ENSG00000258815.1   .   +   HAVANA  gene    .   ID=ENSG00000258815.1;gene_id=ENSG00000258815.1;transcript_id=ENSG00000258815.1;gene_type=lincRNA;gene_status=NOVEL;gene_name=RP11-408B11.2;transcript_type=lincRNA;transcript_status=NOVEL;transcript_name=RP11-408B11.2;level=2;havana_gene=OTTHUMG00000170606.1

I have multiple .bed files that I need to extract gene_name and sort into a new file.

One file per bed or all genes in a single file?

If it is one file each for each bed and genes are sorted within each file ( assuming that bed files have .bed extension and they all are in the same folder and user has gnu-parallel installed):

$ ls *.bed |  parallel "awk -F'[;=]' '{print \$12}' {} |sort  > {.}.genes.txt "

There will be multiple output files with "genes.txt" appended to original bed name (without .bed extension).

If it is a single file with all the genes and sorted:

$ awk -F "[;=]" '{print $12}' *.bed | sort > genes_output.txt
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by cpad011212k

Thank you very much

ADD REPLYlink written 2.1 years ago by g.sathish.k0
1
gravatar for Pierre Lindenbaum
2.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

use awk to extract the gene_name, create a new column, sort on this column, remove the column:

awk -F '\t' '{n=split($10,a,/[;]/);G="";for(i=1;i<=n;++i) if(substr(a[i],1,10)=="gene_name=") G=substr(a[i],11); printf("%s\t%s\n",G,$0);}' input.tsv | sort -t $'\t' -k1,1 | cut -f 2-
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Pierre Lindenbaum124k

Thank you very much!

ADD REPLYlink written 2.1 years ago by g.sathish.k0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1957 users visited in the last hour