Question

Labelling sequences within fasta files according to sample name.

0

Entering edit mode

6.6 years ago

Mitra • 0

Hello everybody, I have multiple fasta files from multiple samples I am trying to add the sample names in each sequence within each fasta file.

My one file looks like :

>M03691:51:000000000-BD94Y:1:1101:14841:1381 1:N:0:1
ACTGGGTGTAAAGGGCGTGTAGGCGGAGAAGCAAGTCAGAAGTGAAATCCATGGGCTTAACCCATGAACTGCTTTTGAAACTGTTTCCCTTGAGTATCGGAGAGGCAGGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCCTGCTGGACGACAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCGGT
>M03691:51:000000000-BD94Y:1:1101:15960:1389 1:N:0:1
TACTGGGGTATCTAATCCTATTTGCTCCCCACGCTTTCGGGACTGAGCGTCAGTTATGCGCCAGATCGTCGCCTTCGCCACTGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACGATCCTCTCACACACTCTAGCTCTACGGTTTCCATGGCTTACCGAAGTTAAGCTTCGATCTTTCACCACAGACCCTTAGTGCCGCCTGCTCCCTCTTTACACCCAGT
>M03691:51:000000000-BD94Y:1:1101:15662:1415 1:N:0:1
ACTGGGTGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCTGCGCCGGGTACGGGCGGGCTGGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCAATGGCGAAGGCAGGTCTCTGGGCCGTTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCCCGTA

Now For example I want to add Sample1 in front of every sequence in this file. Keeping everything else as it is.

In this case I found one old post in Biostart which is very similar.. C: Renaming Entries In A Fasta File ..But it completely rename the header. I want to keep everything only add the sample name after >.

And addition I want to do this for a batch of files. So I assume I need to run some loop?

I am trying with this code...and obviously not successful. When I am in the folder where I have all the fasta files. I try to do this.

  for f in *.fasta ; do
     bname=`basename $f`
      pref=${bname%%.fasta}
       awk '/^>/{print ">bname" ++i next; next}{print}' < $f > ${pref}_new.fasta; 
    done

Can anybody please help me with this? Thanks,

Mitra

sequencing next-gen awk • 3.9k views

ADD COMMENT • link updated 6.6 years ago by Brian Bushnell 20k • written 6.6 years ago by Mitra • 0

0

Entering edit mode

Are you doing this for the purpose of adding read groups to the samples?

ADD REPLY • link 6.6 years ago by arfesta ▴ 40

0

Entering edit mode

I need to add Sample names to all sequences as later I want to concatenate all fasta files together and feed it for OTU picking in qiime. Thanks.

ADD REPLY • link 6.6 years ago by Mitra • 0

0

Entering edit mode

Qiime has accessory programs to do this sort of thing. Are you not following their workflow?

ADD REPLY • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

Please let me know if QIIME has any direct way to do this? I could't find any ..... There only it said I need to pass the file as labelled if I work with demultiplexed files. But not said how I can do this. So this is above as I was trying.

ADD REPLY • link 6.6 years ago by Mitra • 0

0

Entering edit mode

6.6 years ago

Brian Bushnell 20k

BBMap has a tool called rename.sh which I use for this purpose:

rename.sh in=file.fa prefix=sample1 out=renamed.fa addprefix

There's also a related tool, "muxbyname.sh", which is great for bulk operations (renaming sequences from many files based on their origin file and outputting them into a single file), but not quite applicable in this case.

ADD COMMENT • link 6.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian, should I run this in a for loop for multiple input file? Can you please suggest?Thanks Mitra

ADD REPLY • link 6.6 years ago by Mitra • 0

0

Entering edit mode

Is there a correlation between existing file names and the sample prefix that could be leveraged to create a loop? Do your files still have the barcodes at the beginning of the reads?

ADD REPLY • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

yes there is correlation ...File names are same as sample name which I want to insert after > for each sequence in each multifasta file.

These files are already demultiplexed and adapter+barcode removed. Then I stitched them using fastq join (within qiime 1.1.9) and converted them to fasta from fastq using fastx-toolkit. So now ihave all multifasta files for each sample.

ADD REPLY • link 6.6 years ago by Mitra • 0

score 1 · Accepted Answer · 2017-09-07

1

Entering edit mode

6.6 years ago

shoujun.gu ▴ 380

I modified the script. You could try:

save the code in a file named 'biostar.py' (or any other name)
move all your fasta files into a new folder named 'newfolder' (or any other name)
in shell (make sure python version is 3.5 or later), run: python3 biostar.py newfolder
the output files are in the same folder with '_out' at the end of the original filename. You can modify it to what you want in the script.

import sys
import subprocess
import os
dir=sys.argv[1]
os.chdir(dir)
p=subprocess.run(["ls"], stdout=subprocess.PIPE)
filelist=p.stdout.strip().decode('ascii').split('\n')

for name in filelist:
    output=name+'_out'
    fa=[]
    with open(name, 'r') as file:  
        for line in file:  
            if line[0]=='>':  
                line='>'+name+line[1:]  
            fa.append(line)  
    with open(output, 'w') as out:  
        out.writelines(fa)

hope the result is what you want

ADD COMMENT • link 6.6 years ago by shoujun.gu ▴ 380

0

Entering edit mode

Hi shoujun.gu, should I run this in a for loop for multiple input file? I have over 100 fasta files to run... Can you please suggest?Thanks Mitra

ADD REPLY • link 6.6 years ago by Mitra • 0

0

Entering edit mode

i modified the previous answer. Hope it works.

ADD REPLY • link 6.6 years ago by shoujun.gu ▴ 380

0

Entering edit mode

Thank you very very much shoujun.gu you saved my day :) A BIG Thank you

ADD REPLY • link 6.6 years ago by Mitra • 0