Question: Labelling sequences within fasta files according to sample name.
0
gravatar for Mitra
2.5 years ago by
Mitra0
Mitra0 wrote:

Hello everybody, I have multiple fasta files from multiple samples I am trying to add the sample names in each sequence within each fasta file.

My one file looks like :

>M03691:51:000000000-BD94Y:1:1101:14841:1381 1:N:0:1
ACTGGGTGTAAAGGGCGTGTAGGCGGAGAAGCAAGTCAGAAGTGAAATCCATGGGCTTAACCCATGAACTGCTTTTGAAACTGTTTCCCTTGAGTATCGGAGAGGCAGGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCCTGCTGGACGACAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCGGT
>M03691:51:000000000-BD94Y:1:1101:15960:1389 1:N:0:1
TACTGGGGTATCTAATCCTATTTGCTCCCCACGCTTTCGGGACTGAGCGTCAGTTATGCGCCAGATCGTCGCCTTCGCCACTGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACGATCCTCTCACACACTCTAGCTCTACGGTTTCCATGGCTTACCGAAGTTAAGCTTCGATCTTTCACCACAGACCCTTAGTGCCGCCTGCTCCCTCTTTACACCCAGT
>M03691:51:000000000-BD94Y:1:1101:15662:1415 1:N:0:1
ACTGGGTGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCTGCGCCGGGTACGGGCGGGCTGGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCAATGGCGAAGGCAGGTCTCTGGGCCGTTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCCCGTA

Now For example I want to add Sample1 in front of every sequence in this file. Keeping everything else as it is.

In this case I found one old post in Biostart which is very similar.. C: Renaming Entries In A Fasta File ..But it completely rename the header. I want to keep everything only add the sample name after >.

And addition I want to do this for a batch of files. So I assume I need to run some loop?

I am trying with this code...and obviously not successful. When I am in the folder where I have all the fasta files. I try to do this.

  for f in *.fasta ; do
     bname=`basename $f`
      pref=${bname%%.fasta}
       awk '/^>/{print ">bname" ++i next; next}{print}' < $f > ${pref}_new.fasta; 
    done

Can anybody please help me with this? Thanks,

Mitra

awk next-gen sequencing • 1.6k views
ADD COMMENTlink modified 2.5 years ago by Brian Bushnell17k • written 2.5 years ago by Mitra0

Are you doing this for the purpose of adding read groups to the samples?

ADD REPLYlink written 2.5 years ago by arfesta30

I need to add Sample names to all sequences as later I want to concatenate all fasta files together and feed it for OTU picking in qiime. Thanks.

ADD REPLYlink written 2.5 years ago by Mitra0

Qiime has accessory programs to do this sort of thing. Are you not following their workflow?

ADD REPLYlink written 2.5 years ago by genomax78k

Please let me know if QIIME has any direct way to do this? I could't find any ..... There only it said I need to pass the file as labelled if I work with demultiplexed files. But not said how I can do this. So this is above as I was trying.

ADD REPLYlink written 2.5 years ago by Mitra0
1
gravatar for shoujun.gu
2.5 years ago by
shoujun.gu370
Rockville/MD
shoujun.gu370 wrote:

I modified the script. You could try:

  1. save the code in a file named 'biostar.py' (or any other name)
  2. move all your fasta files into a new folder named 'newfolder' (or any other name)
  3. in shell (make sure python version is 3.5 or later), run: python3 biostar.py newfolder
  4. the output files are in the same folder with '_out' at the end of the original filename. You can modify it to what you want in the script.
import sys
import subprocess
import os
dir=sys.argv[1]
os.chdir(dir)
p=subprocess.run(["ls"], stdout=subprocess.PIPE)
filelist=p.stdout.strip().decode('ascii').split('\n')

for name in filelist:
    output=name+'_out'
    fa=[]
    with open(name, 'r') as file:  
        for line in file:  
            if line[0]=='>':  
                line='>'+name+line[1:]  
            fa.append(line)  
    with open(output, 'w') as out:  
        out.writelines(fa) 

hope the result is what you want

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by shoujun.gu370

Hi shoujun.gu, should I run this in a for loop for multiple input file? I have over 100 fasta files to run... Can you please suggest?Thanks Mitra

ADD REPLYlink written 2.5 years ago by Mitra0

i modified the previous answer. Hope it works.

ADD REPLYlink written 2.5 years ago by shoujun.gu370

Thank you very very much shoujun.gu you saved my day :) A BIG Thank you

ADD REPLYlink written 2.5 years ago by Mitra0
0
gravatar for Brian Bushnell
2.5 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

BBMap has a tool called rename.sh which I use for this purpose:

rename.sh in=file.fa prefix=sample1 out=renamed.fa addprefix

There's also a related tool, "muxbyname.sh", which is great for bulk operations (renaming sequences from many files based on their origin file and outputting them into a single file), but not quite applicable in this case.

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Brian Bushnell17k

Hi Brian, should I run this in a for loop for multiple input file? Can you please suggest?Thanks Mitra

ADD REPLYlink written 2.5 years ago by Mitra0

Is there a correlation between existing file names and the sample prefix that could be leveraged to create a loop? Do your files still have the barcodes at the beginning of the reads?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax78k

yes there is correlation ...File names are same as sample name which I want to insert after > for each sequence in each multifasta file.

These files are already demultiplexed and adapter+barcode removed. Then I stitched them using fastq join (within qiime 1.1.9) and converted them to fasta from fastq using fastx-toolkit. So now ihave all multifasta files for each sample.

ADD REPLYlink written 2.5 years ago by Mitra0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 780 users visited in the last hour