Question: Extract gene location and gene name from bed file for FASTA file
0
gravatar for saamar.rajput
2.0 years ago by
Germany
saamar.rajput60 wrote:

I have 2 files, one fasta file and another gff file. In this way

 head Fasta

    >NC_002929.2 Bordetella pertussis Tohama I chromosome, complete genome
    ATGGATTTTCCCCGCGAATTTGATGTGATCGTCGTTGGTGGCGGTCACGCCGGTACGGAGGCAGCCCTGGCTGCAGCCCG
    CGCCGGCGCACAGACATTGCTGCTTACCCACAATATCGAGACCCTGGGCCAAATGTCCTGCAATCCCTCCATCGGGGGGA
    TAGGCAAGGGTCATTTGGTCAAGGAAGTCGATGCGTTGGGCGGCGCGATGGCTATCGCCACCGACGAGGCAGGTATCCAA
    TTCCGTATTCTCAACAGCTCCAAGGGGCCAGCGGTACGTGCCACGCGTGCCCAAGCCGACCGGGTGCTGTACCGAAACGC
    CATACGTGCACAGCTCGAGAACCAGCCCAACCTCTGGCTGTTCCAGCAGGCGGTGGACGATCTGATGGTGCAGGGCGACC
    AGGTGGTGGGCGCCGTTACGCAGATCGGGTTGCGCTTTCGTGCCCGTACCGTGGTGCTGACGGCTGGGACCTTCCTCAAC
    GGTTTGATTCACGTGGGGCTGCAGAACTATTCCGGAGGGCGGGCAGGGGATCCTCCCGCCAATTCCCTGGGCCAGCGGCT
    CAAGGAGCTGCAACTTCCGCAAGGCCGCCTGAAAACTGGCACGCCGCCGCGCATCGACGGACGCAGCATCAACTACAGTG
    TGTTGGAAGAGCAGCCCGGCGATCTTGATCCCGTGCCGGTGTTCTCGTTCCTGGGCAAGGCCTCCATGCACCCGCGCCAG
    CTGCCTTGCTGGATCACGCATACCAATGCCCGCACGCACGAAATCATCCGTGGCGGTCTGGACCGTTCGCCCATGTACAG
    TGGGGTCATCGAAGGAGTGGGGCCTCGTTACTGCCCATCCATCGAGGACAAGATCCATCGTTTTGCGGACAAGGCATCGC
    ACCAGGTATTCCTGGAACCGGAAGGCCTGAATACCCATGAGATCTATCCGAACGGTGTTTCCACCAGCCTGCCTTTCGAT
    GTGCAGTACGAGTTGATCCATTCCCTGCCCGGACTGG

then i have the gff file

head gff

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM19571v1
#!genome-build-accession NCBI_Assembly:GCF_000195715.1
##sequence-region NC_002929.2 1 4086189
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=257313
NC_002929.2 RefSeq  region  1   4086189 .   +   .   ID=id0;Dbxref=taxon:257313;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;old-name=Bordetella pertussis;strain=Tohama I
NC_002929.2 RefSeq  gene    1   1920    .   +   .   ID=gene0;Dbxref=GeneID:2664547;Name=gidA;gbkey=Gene;gene=gidA;gene_biotype=protein_coding;locus_tag=BP0001
NC_002929.2 RefSeq  CDS 1   1920    .   +   0   ID=cds0;Parent=gene0;Dbxref=Genbank:NP_878920.1,GeneID:2664547;Name=NP_878920.1;Note=GidA%3B glucose-inhibited cell division protein A%3B involved in the 5-carboxymethylaminomethyl modification (mnm(5)s(2)U) of the wobble uridine base in some tRNAs;gbkey=CDS;gene=gidA;product=tRNA uridine 5-carboxymethylaminomethyl modification protein;protein_id=NP_878920.1;transl_table=11
NC_002929.2 RefSeq  sequence_feature    19  1893    .   +   .ID=id1;Dbxref=GeneID:2664547;Note=HMMPfam hit to PF01134%2C Glucose inhibited division protein A;gbkey=misc_feature;gene=gidA
NC_002929.2 RefSeq  sequence_feature    820 864 .   +   .ID=id2;Dbxref=GeneID:2664547;Note=ScanRegExp hit to PS01280%2C Glucose inhibited division protein A family signature 1. Confirmed by InterPro eMOTIF pattern match.;gbkey=misc_feature;gene=gidA

i want to use the fasta file as a mapping file, but I need the to convert the fasta file to bed file first. I tried

bedtools getfasta -fi Fasta -bed gff -tab -fo Testing

it gives an ouput like this

NC_002929.2:2605-3403   ATGAAAAACATACCGCCCAGCAAGTCCGCCCGCGTGTTCTGCATCGCCAACCAGAAGGGCGGCGTCGGCAAGACCACCACCGCCATCAACCTTGCGGCTGGCCTGGCTACGCACAAGCAGCGGGTGCTGCTGGTCGATCTCGATCCGCAGGGCAACGCCACCATGGGCAGCGGCATCGACAAGAGTACGCTCGAATCCAACCTGTACCAGGTGCTCATCGGCGAGGCCGGTATCGAACAGACGCGCGTGCGTTCGGAGTCCGGCGGCTACGACGTATTGCCGGCCAACCGCGAACTGTCCGGCGCCGAGATCGACCTGGTGCAGATGGACGAGCGCGAGCGCCAGCTCAAGGCCGCCATCGACAAGATCGCCGGCGAATACGATTTCGTGCTGATCGATTGCCCGCCCACGCTGTCGCTGCTTACCCTTAACGGGCTGGCTGCCGCGCACGGCGTCATCATTCCGATGCAGTGCGAGTACTTTGCGCTCGAAGGCCTGTCCGACCTGGTAAACACCATCAAGCGCGTGCATCGCAATATCAACAACGAACTCCGTGTCATCGGTTTGTTGCGCGTGATGTTCGACCCGCGCATGACCTTGCAGCAGCAGGTGTCGGCCCAGCTCGAATCCCACTTCGGCGACAAGGTCTTCACCACGGTGGTGCCACGCAATGTGCGGTTGGCCGAGGCGCCCAGCTATGGCATGCCGGGCGTGGTGTATGACCGCGCGTCGCGCGGCGCGCAGGCCTATATTGCATTTGGCGCGGAAATGATAGAACGCGTCAAAGAGCTGGATTGA
NC_002929.2:2605-3403   ATGAAAAACATACCGCCCAGCAAGTCCGCCCGCGTGTTCTGCATCGCCAACCAGAAGGGCGGCGTCGGCAAGACCACCACCGCCATCAACCTTGCGGCTGGCCTGGCTACGCACAAGCAGCGGGTGCTGCTGGTCGATCTCGATCCGCAGGGCAACGCCACCATGGGCAGCGGCATCGACAAGAGTACGCTCGAATCCAACCTGTACCAGGTGCTCATCGGCGAGGCCGGTATCGAACAGACGCGCGTGCGTTCGGAGTCCGGCGGCTACGACGTATTGCCGGCCAACCGCGAACTGTCCGGCGCCGAGATCGACCTGGTGCAGATGGACGAGCGCGAGCGCCAGCTCAAGGCCGCCATCGACAAGATCGCCGGCGAATACGATTTCGTGCTGATCGATTGCCCGCCCACGCTGTCGCTGCTTACCCTTAACGGGCTGGCTGCCGCGCACGGCGTCATCATTCCGATGCAGTGCGAGTACTTTGCGCTCGAAGGCCTGTCCGACCTGGTAAACACCATCAAGCGCGTGCATCGCAATATCAACAACGAACTCCGTGTCATCGGTTTGTTGCGCGTGATGTTCGACCCGCGCATGACCTTGCAGCAGCAGGTGTCGGCCCAGCTCGAATCCCACTTCGGCGACAAGGTCTTCACCACGGTGGTGCCACGCAATGTGCGGTTGGCCGAGGCGCCCAGCTATGGCATGCCGGGCGTGGTGTATGACCGCGCGTCGCGCGGCGCGCAGGCCTATATTGCATTTGGCGCGGAAATGATAGAACGCGTCAAAGAGCTGGATTGA

it is exactly what I desire to start my analysis but I also need the gene names with the gene locations in the Testing file. Any help on how to do this?

rna-seq sequence genome • 960 views
ADD COMMENTlink written 2.0 years ago by saamar.rajput60
1

Did you try using the -name option in the bedtools syntax? If that gives you the gene name in the first column you could add the starting and ending positions of the genes to the bedtools output by yourself with awk.

ADD REPLYlink written 2.0 years ago by mike-zx210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1263 users visited in the last hour