Question: bedtools getfasta (how to get only cds coding sequence)
0
gravatar for benjamin
12 months ago by
benjamin0
benjamin0 wrote:

Hi all, I'm actually using bedtools to extracte my cds in fasta format from a gff3 file. The command is : bedtools getfasta -fi scaf0_test.fa -bed run_augustus_0035.gff.out -fo fasta.out -s

But the problem is that it gives me not only the coding part (without stop codon inside but also the non-coding part)

For exemple, here is my protein sequence from de gff file:

MIAALVPREPPVGPPGRRAEGRGGIHCDATNNHTYTSPHIHSRHVHAVADRMSTLGKNLIGAPWRGVTTRGTTVSSVL
RAILQDSDKPESEHSGHACEQCGKLYKTVRTLYSHIMLKHPKDKEEVQCSVCDKKYSSALSLQKHVRYMHRYEHRCK
TCYRTFATSEMLVCHRESCYNNVSPCPVCNKIFDSRLALRNHINYNHPRSEESSVQERRQCNVCSRMFTSSRSLLNHM
AAIHPVGTTDCNLCGRTFNSMPAFRSHFLYKHGDHGVHCTKCHKLFATDTSLRRHMEKVHGKNNKPGFLCGICQTYFYK
SSDLVQHILANHKETSSD

And here is my dna sequence comming from the outfile (traducted):

   MIAALVPREPPVGPPGRRAEGRGGIHCDATNNHTYTSPHIHSRHVHAVADRMSTLGKNLIGAPWRGVTTRGTTVR*YVTNTG
   VDSEEIKDRSCKHLQYRAPLYARA*FALIRHCADATTLPNITAFGIVTLPV**YKIN*THYCGVICAQS*TALHVMTC*RDECFRFQI
    RRSLTLITMVVLIIMRTPRTIYT*TNPRNHILSP*SLRCFLTYIHVYKIYKNRSRTIRRGQVRSQYIECYIYTL*HPCLQSLSNCMRQ
    YLCYRSQKQVCIFIEDPACLSETAIFRIRCKRKRHIFTQIFTSNLLNLPTLKFQRNKYPFKFIYF**ISTNLFNCKLSTKISCKKKEIF
    FSSSSFFFSTDIYQLRRVCTYITENETRKDVNIINTLQNEC*NRYC*YRSKHSSVGN*YR*GTFAFFQFRIARHSTRLRQTGERA
    FRARL*AMRQTL*NSEDALFAYHAETPQR*RGGAVFRLR*KVLVGTESTETRALHAPL*APM*NVLQDFCNV*NASLSQGKLLQ
    QRVPMPRVQQDI*LPTCTSESYKL*SSKK*RKFGSRKEAMQRL*PHVY*FS*PPQPHGGYTPGRHD*LQFVRPHF*FHASLSKP
    FPLQAR*PRRTLHKVS*IIRNRYKSSPAHGESTRQK**TRFLVRYMPNLFL*IE*PGAAHIGKP*GDIE*L

As you can see, there are a lot of stop codon, do you know why I do not get the exact same protein sequence? The fasta dna sequence should contain two cds concatened (one with 74 bases and the other should have 256), then the total length of this sequence should be 330.

Here is an exemple of my gff file:

# gffread v0.9.9
##gff-version 3
scaffold_0  AUGUSTUS    mRNA    42655   44668   .   -   .   ID=g1.t1;geneID=g1
scaffold_0  AUGUSTUS    CDS 42655   43423   0.94    -   1   Parent=g1.t1
scaffold_0  AUGUSTUS    CDS 44445   44668   0.82    -   0   Parent=g1.t1
scaffold_0  AUGUSTUS    mRNA    51102   55274   .   +   .   ID=g2.t1;geneID=g2
scaffold_0  AUGUSTUS    CDS 51102   51192   0.60    +   0   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 51310   51528   0.80    +   2   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 52816   52845   0.64    +   2   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 53114   53223   0.50    +   2   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 53333   53633   0.91    +   0   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 53981   54296   0.64    +   2   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 54559   54581   0.94    +   1   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 54783   54975   0.89    +   2   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 55184   55274   0.66    +   1   Parent=g2.t1
gff fasta mac • 846 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by benjamin0

Have you tried gffread - part of gff utilities (http://ccb.jhu.edu/software/stringtie/gff.shtml)?

ADD REPLYlink written 12 months ago by jean.elbers980

Hi, thanks for you help. How could i run bedtools geftasta with a cds entrie? I cannot find this argument on the commande line?

ADD REPLYlink written 12 months ago by benjamin0

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLYlink written 12 months ago by genomax65k

Ok, it is done, thank you.

ADD REPLYlink written 12 months ago by benjamin0
1
gravatar for Carlo Yague
12 months ago by
Carlo Yague4.4k
Belgium
Carlo Yague4.4k wrote:

I think that bedtools getfasta will not parse your gff3 file based on [cds -> mRNA] relationship. Therefore you get either the sequence cds by cds, or from start to end of the mRNA (including introns). The frame shift caused by the introns would explain the number of stop codons you find.

This would make sense based on the example you provide:

Length of g1.t1 first intron = 44668-44445 = 223
number of a.a encode in the first intron = 223/3 ~= 74
number of consequently matching a.a bewteen theoretical sequence and the outfile = 75 a.a
( MIAALVPREPPVGPPGRRAEGRGGIHCDATNNHTYTSPHIHSRHVHAVADRMSTLGKNLIGAPWRGVTTRGTTV )

A solution would be to run bedtools getfasta your cds entries only, then to merge them by ID.

ADD COMMENTlink written 12 months ago by Carlo Yague4.4k

Hi, thanks for you help. How could i run bedtools geftasta with a cds entrie? I cannot find this argument on the commande line?

ADD REPLYlink written 12 months ago by benjamin0

As @Carlo said you will have to get CDS entries individually and then merge them for individual ID's. But the easier option would likely be using the gffread utility that was posted above by @jean.elbers.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax65k

just create the file yourself. Something like:

awk '$3 == "CDS" {print $0}'  run_augustus_0035.gff.out > run_augustus_0035_cds_only.gff
ADD REPLYlink modified 12 months ago • written 12 months ago by Carlo Yague4.4k

Thanks you, it works now but now I have another issue beacause it works well for my sequences starting by ATG (0 reading frame), but when I have a CDS which has a reading frame 1 or 2, my sequences are not in the good reading frame and I have to delete 2 or 1 base for each sequence. Does someone have an idea how I could from my gff file, see the readind frame and if it is 1, delete one base or 2 delete 2 bases to get the correct amino acide sequence when I translate the dna sequence?

ADD REPLYlink modified 12 months ago • written 12 months ago by benjamin0

You need to merge the DNA sequence of your cds before translating it, it will fix the reading phase problem.

ADD REPLYlink written 12 months ago by Carlo Yague4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1048 users visited in the last hour