Question: How to recompute coding sequences from a psl format ?
0
gravatar for Thomas B.
22 months ago by
Thomas B.10
Thomas B.10 wrote:

Hi everyone,

I mapped coding sequences (CDSs) onto a genome using GMAP and got a report in psl format.

Is there a readily available tool / way to extract the genome sequences that matched - I mean those that correspond to blocks - then recompute coding sequences ?

cds gmap sequence alignment genome • 758 views
ADD COMMENTlink modified 15 months ago by Biostar ♦♦ 20 • written 22 months ago by Thomas B.10
1

You could convert your PSL file to BED using BEDOPS: http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/psl2bed.html

Then, convert the BED to GTF or GFF and extract FASTA sequence over these regions using gffread: http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread

ADD REPLYlink written 22 months ago by Kevin Blighe47k

Hi Kevin,

Thanks for your reply!

I converted my PSL file to BED using psl2bed as you suggested.

psl2bed < my_report.psl > my_report.bed

I then used the online bed_to_gff_converter tool from Galaxy to convert the BED file to GFF.

I finally used gffread to extract FASTA sequences as you also suggested.

./gffread my_report.gff -g my_ref_genome.fa -w my_transcripts.fa

But there is no sequence in the final output file.

Do you know what may be wrong? I would also like to ask you how you convert BED to GTF or GFF?

ADD REPLYlink written 22 months ago by Thomas B.10

Hey Thomas,

Have you looked at the contents of your output files at each step in order to see if the format is okay? The BED-to-GTF format conversion is difficult, but you may want to take a look here: How To Convert Bed Format To Gtf?

Here too: http://onetipperday.sterding.com/2012/08/convert-bed-to-gtf.html

Kevin

ADD REPLYlink written 22 months ago by Kevin Blighe47k
1

Oh wait, there is a nice --exons=genomic option in GMAP that does what I want. However the output should be processed (something like cat my_report.txt | grep -v '<' > my_report.fasta) in order to obtain a true FASTA file. It seems also that if there are multiple matches for a same query, the default behaviour is to keep only the best one.

Now back to what you suggested. It is also possible to convert a BED file directly into FASTA using bedtools: http://bedtools.readthedocs.io/en/latest/

My command was bedtools getfasta -fi reference_genome.fasta -bed my_report.bed -split -fo my_report.fasta. The -split option aims to extract and concatenate the sequences from the BED blocks. Here each match is retained, even multiple matches for a same query.

Thanks for your help!

ADD REPLYlink written 22 months ago by Thomas B.10

Hi Thomas, great that you got it all sorted out.

ADD REPLYlink written 22 months ago by Kevin Blighe47k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1555 users visited in the last hour