2.2 years ago
asalimih ▴ 60

Hi, I have a bed file containing exons of the genes. the name field is specified with name of the gene like (ENSG***). when I run bedtools getfasta I get the sequences of each exon separately. is there a standard way in order to concatenate sequences that have the same gene name? or I should write a script to do this manually on the fasta files.
when I read the bedtools documentation there is a -split switch which is only applicable to bed12 file format. link but my bed files are not bed12.
Thanks in advance

You might try something like that with AGAT --bed file.bed -o file.gff --gff file.gff --fasta file.fasta -t exon --merge -o merged_exon.fa
this produced an empty file. I assume the file.fasta is the genome. here is a demonstration of my bed file:

GL000009.2      56139   58376   ENSG00000278704.1       1       -
GL000194.1      53589   55676   ENSG00000277400.1       1       -
GL000194.1      53593   54832   ENSG00000274847.1       1       -
GL000194.1      55445   55676   ENSG00000274847.1       1       -
GL000194.1      112791  112850  ENSG00000274847.1       1       -
GL000194.1      112791  112850  ENSG00000277400.1       1       -
GL000194.1      114985  115018  ENSG00000277400.1       1       -
GL000194.1      114985  115055  ENSG00000274847.1       1       -
GL000195.1      37433   37534   ENSG00000277428.1       1       -
GL000195.1      42938   44923   ENSG00000276256.1       1       -
Ok it is because the first command create gene features only and the second remove gene feature if they do not have any sub-feature like mRNA,transcript,exon etc. So like that it should work:

# Convert the bed6 to gff (exon feature only) --bed file.bed --primary_tag exon  -o file.gff
# replace Name attribute by Parent attribute
sed 's/Name=/Parent=/'   file.gff > file2.gff
# create a clean gff (optional step) --gff  file2.gff -o file_clean.gff
# Extract exon --gff file_clean.gff --fasta file.fasta -t exon --merge -o merged_exon.fa

Yes file.fasta is the genome from wchich you will extract the sequence from.


