Question: Matching And Extracting Contents Of Two Ncbi Genome Files...??
0
gravatar for Kiran
7.6 years ago by
Kiran70
Bangalore
Kiran70 wrote:

Hello frens, I am new to perl programming still i hv to practice Regular expression and NCBI file handling but here i have a task to do i have done half cud anybody help toing the rest File 1:

Candida glabrata CBS 138 chromosome D, complete genome - 1..651701
283 proteins
Location    Strand    Length    PID        Gene    Synonym        Code    COG Product
17042..17914    -    290    50285983    -    CAGL0D00154g    -    -               
23693..25075    +    460    50285985    -    CAGL0D00176g    -    -    
27559..28710    +    383    50285987    -    CAGL0D00198g    -    -    
29345..29914    +    189    50285989    -    CAGL0D00220g    -    -

So onn 40 lines.....

File 2: Contains

>ref|NC_006027.1|:c17914-17042 hypothetical protein [Candida glabrata CBS 138]
ATGGAAACAGAACATCAGGCAGACAAAAATGCGGAATTGGGTTATGACAGTGGATCAACCGTTGCTCCCC
CCAATAAATATAGTACATTACGCTCTAGGTTCAATTTAGGACCTGACACTATGAGAAATCATGTTATTGC
CTTTTTTGGGGAGTTGGTTGGCACATTCATGTTTTTATGGTGTGCCTATGTTATTGCAAATATTGCAAAT

>ref|NC_006027.1|:23693-25075 hypothetical protein [Candida glabrata CBS 138]
ATGTCTTCTCAAGTTAACGAACCAGAATTTCAACAAGCTTACCACGAAGTTGTTTCCTCTTTGAAGGACT
CTTCTTTGTTCGAAAAGCACCCAAAATATGCTAAGGTTCTTCCAGTTGTCTCTGTCCCAGAGAGAATCAT

sooo on number of locations in file 1 is equal to no. of Seq in File 2..

here is what i hv to do if the location of FILE 1 i.e "17042..17914" matches with the Header of the FILE 2 i.e "c17914-17042 match with either upper or the lower limit

then it should remove header of fasta of file 2 then insert">CAGL0D00154g" which is in synonym column of File 1 , location with the corresponding synonym

then my Output file should come as follows:

File3:

>CAGL0D00154g
ATGGAAACAGAACATCAGGCAGACAAAAATGCGGAATTGGGTTATGACAGTGGATCAACCGTTGCTCCCC
CCAATAAATATAGTACATTACGCTCTAGGTTCAATTTAGGACCTGACACTATGAGAAATCATGTTATTGC
CTTTTTTGGGGAGTTGGTTGGCACATTCATGTTTTTATGGTGTGCCTATGTTATTGCAAATATTGCAAAT

>CAGL0D00176g
ATGTCTTCTCAAGTTAACGAACCAGAATTTCAACAAGCTTACCACGAAGTTGTTTCCTCTTTGAAGGACT
CTTCTTTGTTCGAAAAGCACCCAAAATATGCTAAGGTTCTTCCAGTTGTCTCTGTCCCAGAGAGAATCAT

here is wat i have done

foreach $line(@File1){
    chomp($line);

($f1,$f2,$f3,$f4,$f5,$f6)=split (/\t+/,$line);
    push(@F1,$f1);
    push(@F2,$f2);

so on... } @F1 contains Locations colunm(17042..17914,,) @F6 contains Synonym column (CAGL0D00176g)

same way i colletecd the all the upper limit of location of File 2 i.e(17914,25075,,) @B using

foreach $line(@File2){
    chomp $line;
    if ($line=~/\-(\d*)/){
}

so could anybody help/write code to get output as i specified above

loooking forward for ur code

thank you

ncbi genome file • 1.5k views
ADD COMMENTlink modified 7.6 years ago by ALchEmiXt1.9k • written 7.6 years ago by Kiran70
8

Please can you reformat this question to make it readable. As it stands, no-one is likely to answer because it is almost unintelligible.

ADD REPLYlink written 7.6 years ago by Neilfws48k
5

...and improve the spelling, grammar, etc...

ADD REPLYlink written 7.6 years ago by Casey Bergman18k
2
gravatar for ALchEmiXt
7.6 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

You do not specify any file size but if file 2 remains rather small (or you have x64 and lots of memory) I would suggest for the easy solution to:

Just throw file2 in a hash where the key is the fasta header (either original or processed like you captured the lookup fragment). This will allow an easy lookup from the hash any value from file1 in realtime and you can just spit out the multi fasta with the modified headers quite easily on the fly...

ADD COMMENTlink written 7.6 years ago by ALchEmiXt1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2018 users visited in the last hour