Question: Extract specific genes from a FASTA file.
0
gravatar for alexandru.bologa.marian
11 months ago by

Hello,

I want to extract a specific gene sequence from a multi-FASTA file using bash commands like awk,sed,grep etc. My FASTA file (3Lchr.fasta) comes from FlyBase and every gene in the chromosome has the following annotation type:

> FBgn0010431 type=gene; loc=3L:8406112..8407203; ID=FBgn0010431; name=mtrm; dbxref=FlyBase:FBgn0010431,FlyBase:FBan0018543,FlyBase_Annotation_IDs:CG18543,GB_protein:AAF50422,GB:AI546359,GB:AW943933,GB:AY061524,GB_protein:AAL29072,GB:U03288,GB_protein:AAA03463,UniProt/Swiss-Prot:Q23973,INTERPRO:IPR001660,INTERPRO:IPR011510,OrthoDB7_Drosophila:EOG7MHCJK,EntrezGene:38958,INTERPRO:IPR013761,BDGP_clone:LD47919,InterologFinder:38958,BIOGRID:64371,DPIM:FBgn0010431,DroID:FBgn0010431,DRSC:FBgn0010431,FLIGHT:FBgn0010431,FlyAtlas:CG18543-RA;CG7127-RA,FlyMine:FBgn0010431,Fly-FISH:CG18543,FLYGUT:FBgn0010431,GenomeRNAi:38958,INTERACTIVEFLY:/genebrief/matrimony.htm,modMine:FBgn0010431; derived_computed_cyto=66C11-66C11%3B Limits computationally determined from genome sequence between @P{lacW}l(3)L0139<up>L0139</up>@ and @P{PZ}Gug<up>03928</up>@%26@P{EP}CG6745<up>EP595b</up>@; gbunit=AE014296; MD5=ed5a26fca514a130e2b45aa01727082c; length=1092; release=r5.57; species=Dmel; 

GCCATTTGCCAGCTTGAATTCGCAAAGAACGCATCATTTTCTGTGTCACA
GTATCTAAAAAAAATGGAGAATTCTCGCACGCCCACGAACAAGACCAAAA
TTACGCTTAATCGCACGCCAACGCTAAAGGAGCGCAGATGGAACACCCTG
AAGGTGAACACCTCCAACGTGCGATGCTCTACTCCGATCTTTGGCAACTT
CCGTTCGCCCAATCTCTCGCCCATCGAGAATATGGGCACGAAGGGGAAGA
GTCCAGTGTCGCCCATGCGGTTCGCTACCTTCAAGAAAGTGCCAACGAAG
GTGCATCCCAAGCAGCAGCAGCAGCAGCAGCATCAGCACTGCCATCGCAC
TCAGCTTAAGCCCCCGCCATTCGTGCTGCCCAAGCCGCAGGAGGAGATCA
TCGAGCCGGAGCGAGAAATAAAGAGCTGCAGCAGCCCGGATACCTGTTCG
GATGACTCGAATATGGAGACCTCACTGGCCTTGGAGTCGCGTCGTCGTTC
CATCAAAGCATCGAACCACTCGTACGTGGTTAACCATGCCGCCAATGTGG
AACAGATTCTCATGCACATGGGCCTGGAGAACTATGTGACCAATTTCGAA
GAGGCTCACATCGATCTGGTGGAACTGGCATCCTTGGAGCGTGCTGATCT
TGTTAAAATCGGCCTAAATACCGATGAGGATTGCAACCGTATCATGGATG
TGCTCCACACTCTTTAAGTGAATCGCTGGCATGTCCATCTTTTATCATAC
ACTAAGATTTTAAAATGTTTTATGTGAAGAGAAACTTTGTGATTTTAATA
TATGTGGATAATTAAGTAGGTATTCTAGAATTATTGTAAAAGACCTGGTG
TCTTAAAGCTCGTCGTTTTTAGCCTTAAATAATAAACTCTAGTACACTTT
AAGAATCTATGTAATTCCTAAAATGTTGCTCCGAAGCTCATTTACAAATT
TAAATTTATTTCGTAAGCTAGCAAATTGCATAATACTTTTTGTCAAAAAA
AAAAAAAAAAATTGGGGTATAAACGCTTTTGTAAAACACACATTTTTAAA
ATTAATAAACTCATAAGAATCAAATGCATATTAAAAGGTTAC

I want to extract genes by gene names, in this case the name of the gene is mtrm. Reading some related posts, I managed to successfully extract the gene by its name in 2 steps. Firstly, I edited each of the headliners of the multi FASTA with the following pipeline:

awk -F '[/^>;,]' 'NF>1{print ">" $5} {print $1}' 3Lchr.fasta | awk NF | sed 's/ name=//' > 3Lchr_edit.fasta

Thus, each headline remains only with gene name, in my example >mtrm. This works only for FlyBase annotation type, and if you want to use it for a different annotation, you have to adjust the pipe for your case. Now, all the genes in the FASTA look like this:

>mtrm

GCCATTTGCCAGCTTGAATTCGCAAAGAACGCATCATTTTCTGTGTCACA
GTATCTAAAAAAAATGGAGAATTCTCGCACGCCCACGAACAAGACCAAAA
TTACGCTTAATCGCACGCCAACGCTAAAGGAGCGCAGATGGAACACCCTG

Now, I can extract any gene by its gene name and save it in another FASTA with awk:

awk -v seq="mtrm" -v RS='>' '$1 == seq {print RS $0}' 3Lchr_edit.fasta > mtrm.fasta

I hope these commands helps some of you because for me it was pretty hard to find a solution. If anyone knows/finds an easier or simplified solution in bash, I would appreciate sharing.

bash sequence tutorial fasta gene • 633 views
ADD COMMENTlink modified 11 months ago by Asaf8.0k • written 11 months ago by alexandru.bologa.marian0

crossposted on BIoinfomatics.SE

ADD REPLYlink modified 11 months ago • written 11 months ago by kamiljaron140
3
gravatar for brian.fristensky
11 months ago by
Canada/University of Manitoba
brian.fristensky90 wrote:

A more general solution is to use by Wei Shen. SeqKit is a program that does quite a large number of convenient things with fasta/fastq files, and is efficient for big files.

Example: If I have a fasta file containing chitinase sequences and I want to extract sequences whose names contain the regular expression AHCHIT.*, the command would be

seqkit grep -r -p "AHCHIT.*" chitinase.wrp
>AHCHIT - A.hypogaea mRNA for chitinase (chit 2).
atctcgttcaagtcggcgctctggttgtggatgacagagcagaaaccaaaaccttcttgc
cacaacgtcatggttgggaattacgtgccaacagcatctgatagagcagcaaatagaacc
ttagggtttgggttggttacgaacatcatcaacggcggcctggactgcgg
>AHCHIT1A - A.hypogaea mRNA for chitinase (chit 1A).
atctcgttcaagtcggcgctctggttgtggatgacagcacaaggaaacaagccatcaagc
catgatgtcatcaccggaagatggacaccatcggccgcggacagggcggccggccgagtc
tccggatttggagtgatcacgaacatcatcaacggcggcctggactgcgg
>AHCHIT4 - A.hypogaea mRNA for chitinase (chit 4).
atctcgttcaagtcggcgctctggttgtggatgacaccacaagggaataagccctcgtgc
cacgacgtcatcaccaatgcatggaggccaaccgccactgactcggcggcaggtcgagcc
ccaggctacggtgtcatcacgaacatcatcaacggcggcctggactgcgg

SeqKit is also included as part of the BIRCH system, which ties together hundreds of popular bioinformatics programs in an intuitive graphic interface. An example using seqkit stat to generate statistics from NGS read files can be seen in the video https://www.youtube.com/watch?v=56T05sOcODI

ADD COMMENTlink modified 11 months ago by RamRS27k • written 11 months ago by brian.fristensky90

That website looks stuck in the early 90s with its glaring colors and random odd boxes. You seem to be associated with the tool, maybe hire a web designer to clean up the website a little?

ADD REPLYlink written 11 months ago by RamRS27k

I agree, the site uses pretty basic HTML. In fact, if I had it in the budget, I would probably do exactly that. The ToDo list is very long, and like a lot of things in bioinformatics, new features and bug fixes tend to get priority.

ADD REPLYlink written 11 months ago by brian.fristensky90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 785 users visited in the last hour