Question

How To Extract Sequence From Overlapping Regions From Fasta Files With Perl

0

Entering edit mode

11.4 years ago

speedgummi • 0

OK guys, here is our problem:

We have a modified FASTA file: T-->U (RNA) in head.5.fa

To extract only the sequence we parsed this one liner and put output in new file:

perl -ne 'next if /^\>/; print if /^[AUGC]+/' head.5.fa > oryza_just_NTs_transcript

and now we parsed it to get ID, chromosome, Startpoint end Endpoint of sequence, Sequence:

perl -ne 'chomp;
    ($ID,$chrom,$start,$end,$strand) = ($1,$2,$3,$4,$5) if /^\>(\S+)\s+\w+\:.+\:\S+\:(\d+)\:(\d+)\:(\d+)\:(\S+)\s.*/;
    $seq = $_ if /^[AUGC]+/;
    print "$ID $chrom $start $end $strand $seq\n"' head.5.fa > fasta_id_chrom_start_end_strand

with RNAfold (a program looking for sub-elements in RNA to form secondary structures in the sequence) we parsed the file to extract this:

perl -ne '$ID=$1, next if /^\>(\S+)\s/;
    ($str,$mfe,$start,$z) = ($1,$2,$3,$4) if /^([.()]+)\s+\(\s?(\-\d+\.\d+)\)\s+(\d+)\s+z\=\s?(\-\d+\.\d+)/;
    $length=length($str);
    $end=$start+$length-1;
    print "$ID $start $end $length $z $mfe\n"' head.9.d2.noLP.z.L150.lfold | less

so we got the start,end, length of sequence where secondary structures occur. but not the sequence itself what piece of code is necessary to obtain the sequence of the secondary structures?

fasta perl • 4.0k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 11.4 years ago by speedgummi • 0

0

Entering edit mode

I just want to make it clear: you have coordinates and want to get fasta from them?

ADD REPLY • link 11.4 years ago by PoGibas 5.1k

0

Entering edit mode

exactly!

we have:

-) the modified FASTA file (contains only sequence)

-) the file from RNAfold (containing the coordinates)

we need the code to provide the right sequence for the corresponding coordinates ($start, $end, $length)

ADD REPLY • link 11.4 years ago by speedgummi • 0

2

Entering edit mode

This is what I would do: use bedtools getfasta to extract sequence from "modified FASTA file":
1. Unmodify FASTA (create dummy identifier);
2. Create BED file from RNAfold output (add dummy identifier as <chrom>) -- you already have <start> <end>;
3. Use getfasta

ADD REPLY • link 11.4 years ago by PoGibas 5.1k

0

Entering edit mode

The bedtools/getfasta support page tells me:

The headers in the input FASTA file must exactly match the chromosome column in the BED file.

I think it should look like this

>chr1 
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

but it looks like this:

>Syng_TIGR_043 dna:scaffold scaffold:IRGSP-1.0:Syng_TIGR_043:1:4236:1
GGATAGAGCTTCGGCACGTGCCCAGACGGGATAACGCGGTCGCCGACGAACTCTCAAGGC
TCGCTTCCTCGCGAGCCCAGACCCCACCGGGCGCCTTTGAAGAAAGGCTTGCCCAGCCGT
CGGCGCGACCCGAACCCCCTAGGGGAGACGGACGCGCCTGAACGGCCCCCGAGGCCCGTC

how do I get rid of the "crap information" that getfasta can read this?

you told me something about the dummy identifier, could you explain what you mean? Sorry, I'm quite new to this.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 11.4 years ago by speedgummi • 0

1

Entering edit mode

What you need:

FASTA file
```
>whatever_identifier_you_prefer  
ACGT
```
BED file
```
whatever_identifier_you_prefer   1  10
```

Please read about these file types (FASTA, BED)

You should fix your example from >chr1 AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG into

>chr1   
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

If your identifier looks like this >Syng_TIGR_043 dna:scaffold scaffold:IRGSP-1.0:Syng_TIGR_043:1:4236:1 you must have the same in the first BED file column.

ADD REPLY • link 11.4 years ago by PoGibas 5.1k

Ram · Answer 1 · 2014-01-31

The website www.pythonforbiologists.com has a section on exactly this (ctrl+f down to the section "Extracting part of a string")

Example:

1 protein = "vlspadktnv"
2 # print positions three to five
3 print(protein[3:5])
4 # positions start at zero, not one
5 print(protein[0:6])

So what you want to do is loop over your fasta file and with your co-ordinates (untested puedo-code):

I can't think of fasta parsing off the top of my head so I'm imagining that your fasta file is laid out like so, but either parse properly or bioperl based on preference:

HEADER    SEQUENCE.

I'm imagining your co-ords file is as so:

HEADER    START    END    LENGTH

..
for a in fasta_file:
    fasta = split('\t', a)

    for b in coords_file:
        coords = split('\t', b)

        if (fasta[0] == coords[0]):
            cut_fasta = fasta[1]([coords[1]:coords[2])
            print fasta[0]\tcut_fasta

Something like that would do the job. It's probably not very efficient but it's simple.