Question: scripting problem to change string match of one file to another
2
gravatar for rob234king
5.8 years ago by
rob234king600
UK/Harpenden/Rothamsted Research
rob234king600 wrote:

I have a map file and gff file and all I want to do is for each line in the map file match the first column for each line in gff file and any matches and replaced with the second column from map file. I've tried the map id script from maker2 but isn't working beyond the mRNA field, the parent CDS field is not changed so not much use to me.

 

gff file tab delineated.  

Chromosome_1 Geneious mRNA 421 2049 . - . ID=12;Parent=12-gene
Chromosome_1 Geneious gene 421 2049 . - . ID=12-gene
Chromosome_1 Geneious CDS 421 2049 . - . ID=12:cds;Parent=12
Chromosome_1 Geneious CDS 3747 6569 . - . ID=rmRNA-1:cds;Parent=rmRNA-1
Chromosome_1 maker mRNA 3747 6569 . - . ID=rmRNA-1;Parent=gene1
Chromosome_1 maker gene 3747 6569 . - . ID=gene1
                 

Map file tab delineated

12    B1
rmRNA-1    B2
 

 

output desired:

Chromosome_1 Geneious mRNA 421 2049 . - . ID=B1;Parent=B1-gene
Chromosome_1 Geneious gene 421 2049 . - . ID=B1-gene
Chromosome_1 Geneious CDS 421 2049 . - . ID=B1:cds;Parent=B1
Chromosome_1 Geneious CDS 3747 6569 . - . ID=B2:cds;Parent=B2
Chromosome_1 maker mRNA 3747 6569 . - . ID=B2;Parent=gene1
Chromosome_1 maker gene 3747 6569 . - . ID=gene1
gff • 2.2k views
ADD COMMENTlink modified 5.8 years ago by alec_djinn340 • written 5.8 years ago by rob234king600

Thanks to both responses, accomplish perfectly what I was hoping for. I was spending too much time contemplating doing it by hand which would be insane but was getting quite frustrated in the end when it should be simple like the answers given below.

ADD REPLYlink written 5.8 years ago by rob234king600
3
gravatar for alolex
5.8 years ago by
alolex910
United States
alolex910 wrote:

If you are in a linux/unix environment you could use sed like such:

cat tmp.map | while read old new
do
  sed "s/$old/$new/g" tmp.gff > tmp2.gff
  mv tmp2.gff tmp.gff
done

Just make sure you keep a copy of your original gff file because this code over-writes the file.  The $old will be the value in the first column and the $new is the value in the second.  This loops through each line of your map file in turn to do the replacement. However, this is naive and will replace, for example, all 12's in the file to B1--including any coordinates etc.  Below I've added some pre and post file processing using the awk command to eliminate this issue and only do the replacement in column 9.  You can edit to include other columns if you need to:

#print column 9 to a file by itself 
awk '{print $9}' tmp.gff > col9.txt 

#print all other columns to a separate file 
awk '{print $1,$2,$3,$4,$5,$6,$7,$8}' tmp.gff > other.txt 

#run the replacement on column 9 
cat tmp.map | while read old new 
do 
  sed "s/$old/$new/g" col9.txt > tmp.txt 
  mv tmp.txt col9.txt 
done 

#paste the files back together 
paste other.txt col9.txt > final.gff

 

 

ADD COMMENTlink written 5.8 years ago by alolex910
3
gravatar for alec_djinn
5.8 years ago by
alec_djinn340
European Union
alec_djinn340 wrote:

This is a little python script that will do the job.

 

gff_file = 'gff.txt'
map_file = 'map.txt'
out_file = 'output.txt'

my_map = dict()
with open(map_file, 'r') as f:
    for line in f:
        data = line.split('\t')
        my_map.update({data[0]:data[1]})

with open(out_file, 'w') as out:
    with open(gff_file, 'r') as f:
        
        for line in f:

            data = line.split('\t')
            
            for key, value in my_map.items():
                if key in data[-1]:
                    print(key,value)
                    print(data[-1])
                    data[-1] = data[-1].replace(key, value.strip())
                    print(data[-1])
            
            
            
            for item in data:
                out.write(item+'\t')
ADD COMMENTlink written 5.8 years ago by alec_djinn340
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1473 users visited in the last hour
_