Question: Get cds using bedtools
0
gravatar for User000
2.8 years ago by
User000270
User000270 wrote:

Hello, I have a gff3 file. I'm using the following command line to extract sequences for each CDS:

bedtools getfasta -fi pseudomolecules.fasta -bed all.gff3 -fo CDS.fa

chr1A   rnaseq  CDS 992  1945   .   +   0   mRNA_1
chr1B   rnaseq  CDS 72477  25295    .   +   0   mRNA_7
chr2D   rnaseq  CDS 13  725     .   +   0   mRNA_6

1)Is there a way to get the last lines (mRNA_1) added to the header as well? my header is >chr1A:992-1945 and my wanted header is >chr1A:992-1945:mRNA_1

2) Is this command line going to take me the coordinates from each chromosome, because I have one fasta file containing all chromosomes.

bedtools • 2.0k views
ADD COMMENTlink modified 21 months ago by molly7710 • written 2.8 years ago by User000270

Hi,

I am also trying to use bedtools to extract CDS from a genome fasta. However, I don't undertstand what the bed file is? Do I have to modify my gff file to only contain the hits of CDS?

ADD REPLYlink written 21 months ago by molly7710

Do I have to modify my gff file to extract only CDS from my genome fasta? The bedtools getfasta manual is not informative to me.

ADD REPLYlink written 21 months ago by molly7710

If you only need CDS then yes but it looks like @Dario's answer below is doing that already. Your gff file will need to have CDS entries in column 3 for that to work.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax69k
3
gravatar for dariober
2.8 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Here's my suggestion:

  • Convert gff to bed where the name field (4th column) has the information you want

  • Run getfasta with -name option to assign fasta names from the name field

Not tested:

awk -v OFS='\t' -v FS='\t' '$3 == "CDS" {print $1, $4-1, $5, $1 ":" $4 "-" $5 ":" $9}' all.gff3 \
| bedtools getfasta -name -fi pseudomolecules.fasta -bed - -fo CDS.fa

Turning gff to bed should be just a matter extracting columns 1, 4, 5 and subtracting 1 from the 4th column since bed is 0-based while gff is 1-based.

ADD COMMENTlink written 2.8 years ago by dariober10k

Seems to work, thanks! I have a Q, in the command line I used it give me headers like this i.e. for the 1st line (chr1A rnaseq CDS 992 1945 . + 0 mRNA_1) chr1A:991-1945 instead of 992, in your command line it is giving me chr1A:992-1945:mRNA_1. So it takes 1st nucleotide as 0, but in urs as 1, why is this difference? This is not so important as both cases are ok, I wanted just to understand.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by User000270
2

That difference is explained in his last sentence. See also: Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems

ADD REPLYlink written 2.7 years ago by WouterDeCoster40k

you are right! thanks

ADD REPLYlink written 2.7 years ago by User000270
2
gravatar for venu
2.8 years ago by
venu6.2k
Germany
venu6.2k wrote:

1.After generating fasta file with getfasta, create another file with only mRNA field, mRNA.txt.

Linearize fasta | paste - - | paste - mRNA.txt | awk -F'\t' '{print $1":"$3"\n"$2}' | fold -w 60 > newFasta.fa

2.Would you be more specific?

P.S: Linearize fasta

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by venu6.2k

with only mRNA field using getbasta? how to do it? or just using a simple list of mRNA's? could you explain better Linearize fasta?

ADD REPLYlink written 2.8 years ago by User000270
1

fasta lineraization A: Useful Bash Commands To Handle Fasta Files or here

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Medhat8.4k
0
gravatar for biocyberman
2.8 years ago by
biocyberman770
Denmark
biocyberman770 wrote:

1) Checkout the help for the command below. I think you can experiment with -name or -fullHeader to see if you get what you want.

bedtools getfasta -h                                                                               

Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.26.0
Summary: Extract DNA sequences from a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options: 
        -fi     Input FASTA file
        -bed    BED/GFF/VCF file of ranges to extract from -fi
        -name   Use the name field for the FASTA header
        -split  given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
        -tab    Write output in TAB delimited format.
                - Default is FASTA format.

        -s      Force strandedness. If the feature occupies the antisense,
                strand, the sequence will be reverse complemented.
                - By default, strand information is ignored.

        -fullHeader     Use full fasta header.
                - By default, only the word before the first space or tab is used.

2) Not sure what you are asking, but the coordinates, for example chr1A:992-1945 is specific to chr1A.

ADD COMMENTlink written 2.8 years ago by biocyberman770

How can I experiment...? is not clear at all..

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by User000270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1616 users visited in the last hour