Question: How To Extract Sequence From Overlapping Regions From Fasta Files With Perl
0
gravatar for speedgummi
7.0 years ago by
speedgummi0
speedgummi0 wrote:

ok guys,

here is our problem:

we have a modified FASTA file: T-->U (RNA) in head.5.fa

to extract only the sequence we parsed this one liner and put output in new file:

perl -ne 'next if /^\>/; print if /^[AUGC]+/' head.5.fa > oryza_just_NTs_transcript

and now we parsed it to get ID, chromosome, Startpoint end Endpoint of sequence, Sequence:

perl -ne 'chomp; ($ID,$chrom,$start,$end,$strand) = ($1,$2,$3,$4,$5) if /^\>(\S+)\s+\w+\:.+\:\S+\:(\d+)\:(\d+)\:(\d+)\:(\S+)\s.*/; $seq = $_ if /^[AUGC]+/; print "$ID $chrom $start $end $strand $seq\n"' head.5.fa > fasta_id_chrom_start_end_strand

with RNAfold ( a program looking for subelements in rna to form secondary structures in the sequence) we parsed the file to exract this:

perl -ne '$ID=$1, next if /^\>(\S+)\s/; ($str,$mfe,$start,$z) = ($1,$2,$3,$4) if /^([.()]+)\s+\(\s?(\-\d+\.\d+)\)\s+(\d+)\s+z\=\s?(\-\d+\.\d+)/; $length=length($str); $end=$start+$length-1; print "$ID $start $end $length $z $mfe\n"' head.9.d2.noLP.z.L150.lfold | less

so we got the start,end, length of sequence where secondary structures occur. but not the sequence itself what piece of code is nececessary to obtain the sequence of the secondary structures?

perl fasta • 2.7k views
ADD COMMENTlink modified 7.0 years ago by Daniel3.8k • written 7.0 years ago by speedgummi0

I just want to make it clear: you have coordinates and want to get fasta from them?

ADD REPLYlink written 7.0 years ago by PoGibas4.8k

exactly!

we have:

-) the modified FASTA file (contains only sequence)

-) the file from RNAfold (containing the coordinates)

we need the code to provide the right sequence for the corresponding coordinates ($start, $end, $length)

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by speedgummi0
2

This is what I would do: use bedtools getfasta to extract sequence from "modified FASTA file":
1. Unmodify FASTA (create dummy identifier);
2. Create BED file from RNAfold output (add dummy identifier as <chrom>) -- you already have <start> <end>;
3. Use getfasta

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by PoGibas4.8k

the bedtools/getfasta support page tells me:

"The headers in the input FASTA file must exactly match the chromosome column in the BED file."

I think it should look like this >chr1 AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

but it looks like this: >Syng_TIGR_043 dna:scaffold scaffold:IRGSP-1.0:Syng_TIGR_043:1:4236:1 GGATAGAGCTTCGGCACGTGCCCAGACGGGATAACGCGGTCGCCGACGAACTCTCAAGGC TCGCTTCCTCGCGAGCCCAGACCCCACCGGGCGCCTTTGAAGAAAGGCTTGCCCAGCCGT CGGCGCGACCCGAACCCCCTAGGGGAGACGGACGCGCCTGAACGGCCCCCGAGGCCCGTC

how do i get rid of the "crap information" that getfasta can read this?

you told me stgh about the dummy identifier.... could you explain what you mean? sorrry, i'm quite new to this...

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by speedgummi0
1

What you need:

  • FASTA file

    >whatever_identifier_you_prefer  
    ACGT
    
  • BED file

    whatever_identifier_you_prefer   1  10
    

Please read about these file types (FASTA, BED)

You should fix your example from >chr1 AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG into

>chr1   
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

If your identifier looks like this >Syng_TIGR_043 dna:scaffold scaffold:IRGSP-1.0:Syng_TIGR_043:1:4236:1 you must have the same in the first BED file column.

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by PoGibas4.8k
0
gravatar for Daniel
7.0 years ago by
Daniel3.8k
Cardiff University
Daniel3.8k wrote:

The website www.pythonforbiologists.com has a section on exactly this (ctrl+f down to the section "Extracting part of a string")

Example:

1 protein = "vlspadktnv"
2 # print positions three to five
3 print(protein[3:5])
4 # positions start at zero, not one
5 print(protein[0:6])

So what you want to do is loop over your fasta file and with your co-ordinates (untested puedo-code):

I can't think of fasta parsing off the top of my head so I'm imagining that your fasta file is laid out like so, but either parse properly or bioperl based on preference: 
HEADER    SEQUENCE. 
I'm imagining your co-ords file is as so:
HEADER    START    END    LENGTH

...
for a in fasta_file:
    fasta = split('\t', a)

    for b in coords_file:
        coords = split('\t', b)

        if (fasta[0] == coords[0]):
            cut_fasta = fasta[1]([coords[1]:coords[2])
            print fasta[0]\tcut_fasta

Something like that would do the job. It's probably not very efficient but it's simple.

ADD COMMENTlink written 7.0 years ago by Daniel3.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2505 users visited in the last hour
_