Question: Extract specific genes from a FASTA file.
0
gravatar for alexandru.bologa.marian
5 weeks ago by

Hello,

I want to extract a specific gene sequence from a multi-FASTA file using bash commands like awk,sed,grep etc. My FASTA file (3Lchr.fasta) comes from FlyBase and every gene in the chromosome has the following annotation type:

> FBgn0010431 type=gene; loc=3L:8406112..8407203; ID=FBgn0010431; name=mtrm; dbxref=FlyBase:FBgn0010431,FlyBase:FBan0018543,FlyBase_Annotation_IDs:CG18543,GB_protein:AAF50422,GB:AI546359,GB:AW943933,GB:AY061524,GB_protein:AAL29072,GB:U03288,GB_protein:AAA03463,UniProt/Swiss-Prot:Q23973,INTERPRO:IPR001660,INTERPRO:IPR011510,OrthoDB7_Drosophila:EOG7MHCJK,EntrezGene:38958,INTERPRO:IPR013761,BDGP_clone:LD47919,InterologFinder:38958,BIOGRID:64371,DPIM:FBgn0010431,DroID:FBgn0010431,DRSC:FBgn0010431,FLIGHT:FBgn0010431,FlyAtlas:CG18543-RA;CG7127-RA,FlyMine:FBgn0010431,Fly-FISH:CG18543,FLYGUT:FBgn0010431,GenomeRNAi:38958,INTERACTIVEFLY:/genebrief/matrimony.htm,modMine:FBgn0010431; derived_computed_cyto=66C11-66C11%3B Limits computationally determined from genome sequence between @P{lacW}l(3)L0139<up>L0139</up>@ and @P{PZ}Gug<up>03928</up>@%26@P{EP}CG6745<up>EP595b</up>@; gbunit=AE014296; MD5=ed5a26fca514a130e2b45aa01727082c; length=1092; release=r5.57; species=Dmel; 

GCCATTTGCCAGCTTGAATTCGCAAAGAACGCATCATTTTCTGTGTCACA
GTATCTAAAAAAAATGGAGAATTCTCGCACGCCCACGAACAAGACCAAAA
TTACGCTTAATCGCACGCCAACGCTAAAGGAGCGCAGATGGAACACCCTG
AAGGTGAACACCTCCAACGTGCGATGCTCTACTCCGATCTTTGGCAACTT
CCGTTCGCCCAATCTCTCGCCCATCGAGAATATGGGCACGAAGGGGAAGA
GTCCAGTGTCGCCCATGCGGTTCGCTACCTTCAAGAAAGTGCCAACGAAG
GTGCATCCCAAGCAGCAGCAGCAGCAGCAGCATCAGCACTGCCATCGCAC
TCAGCTTAAGCCCCCGCCATTCGTGCTGCCCAAGCCGCAGGAGGAGATCA
TCGAGCCGGAGCGAGAAATAAAGAGCTGCAGCAGCCCGGATACCTGTTCG
GATGACTCGAATATGGAGACCTCACTGGCCTTGGAGTCGCGTCGTCGTTC
CATCAAAGCATCGAACCACTCGTACGTGGTTAACCATGCCGCCAATGTGG
AACAGATTCTCATGCACATGGGCCTGGAGAACTATGTGACCAATTTCGAA
GAGGCTCACATCGATCTGGTGGAACTGGCATCCTTGGAGCGTGCTGATCT
TGTTAAAATCGGCCTAAATACCGATGAGGATTGCAACCGTATCATGGATG
TGCTCCACACTCTTTAAGTGAATCGCTGGCATGTCCATCTTTTATCATAC
ACTAAGATTTTAAAATGTTTTATGTGAAGAGAAACTTTGTGATTTTAATA
TATGTGGATAATTAAGTAGGTATTCTAGAATTATTGTAAAAGACCTGGTG
TCTTAAAGCTCGTCGTTTTTAGCCTTAAATAATAAACTCTAGTACACTTT
AAGAATCTATGTAATTCCTAAAATGTTGCTCCGAAGCTCATTTACAAATT
TAAATTTATTTCGTAAGCTAGCAAATTGCATAATACTTTTTGTCAAAAAA
AAAAAAAAAAATTGGGGTATAAACGCTTTTGTAAAACACACATTTTTAAA
ATTAATAAACTCATAAGAATCAAATGCATATTAAAAGGTTAC

I want to extract genes by gene names, in this case the name of the gene is mtrm. Reading some related posts, I managed to successfully extract the gene by its name in 2 steps. Firstly, I edited each of the headliners of the multi FASTA with the following pipeline:

awk -F '[/^>;,]' 'NF>1{print ">" $5} {print $1}' 3Lchr.fasta | awk NF | sed 's/ name=//' > 3Lchr_edit.fasta

Thus, each headline remains only with gene name, in my example >mtrm. This works only for FlyBase annotation type, and if you want to use it for a different annotation, you have to adjust the pipe for your case. Now, all the genes in the FASTA look like this:

>mtrm

GCCATTTGCCAGCTTGAATTCGCAAAGAACGCATCATTTTCTGTGTCACA
GTATCTAAAAAAAATGGAGAATTCTCGCACGCCCACGAACAAGACCAAAA
TTACGCTTAATCGCACGCCAACGCTAAAGGAGCGCAGATGGAACACCCTG

Now, I can extract any gene by its gene name and save it in another FASTA with awk:

awk -v seq="mtrm" -v RS='>' '$1 == seq {print RS $0}' 3Lchr_edit.fasta > mtrm.fasta

I hope these commands helps some of you because for me it was pretty hard to find a solution. If anyone knows/finds an easier or simplified solution in bash, I would appreciate sharing.

bash sequence tutorial fasta gene • 214 views
ADD COMMENTlink modified 4 weeks ago by Asaf6.1k • written 5 weeks ago by alexandru.bologa.marian0

crossposted on BIoinfomatics.SE

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by kamiljaron120
3
gravatar for brian.fristensky
5 weeks ago by
Canada/University of Manitoba
brian.fristensky80 wrote:

A more general solution is to use by Wei Shen. SeqKit is a program that does quite a large number of convenient things with fasta/fastq files, and is efficient for big files.

Example: If I have a fasta file containing chitinase sequences and I want to extract sequences whose names contain the regular expression AHCHIT.*, the command would be

seqkit grep -r -p "AHCHIT.*" chitinase.wrp
>AHCHIT - A.hypogaea mRNA for chitinase (chit 2).
atctcgttcaagtcggcgctctggttgtggatgacagagcagaaaccaaaaccttcttgc
cacaacgtcatggttgggaattacgtgccaacagcatctgatagagcagcaaatagaacc
ttagggtttgggttggttacgaacatcatcaacggcggcctggactgcgg
>AHCHIT1A - A.hypogaea mRNA for chitinase (chit 1A).
atctcgttcaagtcggcgctctggttgtggatgacagcacaaggaaacaagccatcaagc
catgatgtcatcaccggaagatggacaccatcggccgcggacagggcggccggccgagtc
tccggatttggagtgatcacgaacatcatcaacggcggcctggactgcgg
>AHCHIT4 - A.hypogaea mRNA for chitinase (chit 4).
atctcgttcaagtcggcgctctggttgtggatgacaccacaagggaataagccctcgtgc
cacgacgtcatcaccaatgcatggaggccaaccgccactgactcggcggcaggtcgagcc
ccaggctacggtgtcatcacgaacatcatcaacggcggcctggactgcgg

SeqKit is also included as part of the BIRCH system, which ties together hundreds of popular bioinformatics programs in an intuitive graphic interface. An example using seqkit stat to generate statistics from NGS read files can be seen in the video https://www.youtube.com/watch?v=56T05sOcODI

ADD COMMENTlink modified 5 weeks ago by RamRS23k • written 5 weeks ago by brian.fristensky80

That website looks stuck in the early 90s with its glaring colors and random odd boxes. You seem to be associated with the tool, maybe hire a web designer to clean up the website a little?

ADD REPLYlink written 4 weeks ago by RamRS23k

I agree, the site uses pretty basic HTML. In fact, if I had it in the budget, I would probably do exactly that. The ToDo list is very long, and like a lot of things in bioinformatics, new features and bug fixes tend to get priority.

ADD REPLYlink written 4 weeks ago by brian.fristensky80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1690 users visited in the last hour