OK guys, here is our problem:
We have a modified FASTA file: T-->U (RNA) in head.5.fa
To extract only the sequence we parsed this one liner and put output in new file:
perl -ne 'next if /^\>/; print if /^[AUGC]+/' head.5.fa > oryza_just_NTs_transcript
and now we parsed it to get ID, chromosome, Startpoint end Endpoint of sequence, Sequence:
perl -ne 'chomp;
($ID,$chrom,$start,$end,$strand) = ($1,$2,$3,$4,$5) if /^\>(\S+)\s+\w+\:.+\:\S+\:(\d+)\:(\d+)\:(\d+)\:(\S+)\s.*/;
$seq = $_ if /^[AUGC]+/;
print "$ID $chrom $start $end $strand $seq\n"' head.5.fa > fasta_id_chrom_start_end_strand
with RNAfold (a program looking for sub-elements in RNA to form secondary structures in the sequence) we parsed the file to extract this:
perl -ne '$ID=$1, next if /^\>(\S+)\s/;
($str,$mfe,$start,$z) = ($1,$2,$3,$4) if /^([.()]+)\s+\(\s?(\-\d+\.\d+)\)\s+(\d+)\s+z\=\s?(\-\d+\.\d+)/;
$length=length($str);
$end=$start+$length-1;
print "$ID $start $end $length $z $mfe\n"' head.9.d2.noLP.z.L150.lfold | less
so we got the start,end, length of sequence where secondary structures occur. but not the sequence itself what piece of code is necessary to obtain the sequence of the secondary structures?
I just want to make it clear: you have coordinates and want to get fasta from them?
exactly!
we have:
-) the modified FASTA file (contains only sequence)
-) the file from RNAfold (containing the coordinates)
we need the code to provide the right sequence for the corresponding coordinates ($start, $end, $length)
This is what I would do: use bedtools getfasta to extract sequence from "modified FASTA file":
1. Unmodify FASTA (create dummy identifier);
2. Create BED file from RNAfold output (add dummy identifier as
<chrom>
) -- you already have<start> <end>
;3. Use getfasta
The bedtools/getfasta support page tells me:
I think it should look like this
but it looks like this:
how do I get rid of the "crap information" that getfasta can read this?
you told me something about the dummy identifier, could you explain what you mean? Sorry, I'm quite new to this.
What you need:
FASTA file
BED file
Please read about these file types (FASTA, BED)
You should fix your example from
>chr1 AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
intoIf your identifier looks like this
>Syng_TIGR_043 dna:scaffold scaffold:IRGSP-1.0:Syng_TIGR_043:1:4236:1
you must have the same in the first BED file column.