Question: Add filename to fasta headers of multiple fasta files inside loop
0
gravatar for bioinfo8
2.2 years ago by
bioinfo8120
bioinfo8120 wrote:

Hi,

I have 10 fasta files (each file with 20 gene sequences from each of the 10 samples). I would like to create 20 files, specific to each gene from 10 samples. I proceeded as follows to extract genes with the file_name in header:

pyfasta extract --header --fasta test.fasta gene_name1 | awk '/^>/ {$0=$0 "_sample1"}1' > gene_name1.fasta

Output:

>gene_name1_sample1
ATGC

I am successful in creating multiple gene fasta files for each gene from each sample (a part from loop):

 pyfasta extract --header --fasta $sample.fasta gene_name1 >> gene_name1.fasta 
 pyfasta extract --header --fasta $sample.fasta gene_name2 >> gene_name2.fasta

But, I am unable to add file_name to the header of files in loop (but can do for 1 file as mentioned in the beginning).

Kindly guide.

Thanks.

bash gene header pyfasta fasta • 2.1k views
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by bioinfo8120

Can you post an example of some of the headers from one of the sample fasta? Is each sample fasta a multi-line or single line fasta? Is each header, per gene, identical across sample fastas?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by st.ph.n2.5k

Header from sample1.fasta file

 >gene_name1
 ATGC..............................max upto 120 characters per line
 TTTG..............................................................
 >gene_name2
 ATGG..............................................................

 ....

Like this upto 20 gene names and their sequences. Each sample fasta files are multi-line. Gene name is same across sample fastas.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinfo8120
2
gravatar for st.ph.n
2.2 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:

The question is pretty generalized with the use of terms. From what I understand, this should work:

If each sample file are not single line fasta, linearize, on commandline:

for file in *.fasta; do awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < $file > "`basename $file .fasta`_single-line.fasta"; done

Then run this python script:

#!/usr/bin/env python

import glob
from collections import defaultdict

genes = defaultdict(list)

for file in glob.glob('*_single-line.fasta'):
    with open(file, 'r') as f:
        for line in f:
            if line.startswith(">"):
                # Gene ID : (sample name, sequence)
                genes[line.strip().split('>')[1]].append((file.split('.fasta')[0], next(f).strip()))

for g in genes:
    # open fasta per gene name
    with open(g + '.fasta', 'w') as out:
        # for each sample name and sequence
        for n in range(len(genes[g])):
            # print to file, >gene_sample, sequence
            out.write('>' + g + '_' + genes[g][n][0] + '\n' + genes[g][n][1])
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by st.ph.n2.5k

Thanks. But, I want to combine with pyfasta so that I can extract gene from each sample file + add sample name (file name) to header in one step. Can you suggest some easier way, such as awk, sed etc. As awk is working for one file and I did not do any linerization eventhough that file is multi-line fasta.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinfo8120

I want to combine with pyfasta so that I can extract gene from each sample file + add sample name (file name) to header in one step

OK, so this is clearly a two step process, however the code can be converted to not have to use the awk statement first, and simply input the multi-line fasta.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by st.ph.n2.5k

@st.ph.n I explained my query in more detail in above comment. I hope its more clear now.

ADD REPLYlink written 2.2 years ago by bioinfo8120
1

Let's first check if the above does what you need.

Put this into a bash script, run.sh:

#!/usr/bin/bash

for file in *.fasta; do 
    awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < $file > "`basename $file .fasta`_single-line.fasta"
done

python combine.py

And the above python code into, combine.py.

Run the pipeline as bash combine.py in the directory with your sample fastas. This will run in one step.

ADD REPLYlink written 2.2 years ago by st.ph.n2.5k

Thanks @st.ph.n. run.sh worked. :)

combine.py also worked but file name did not appear near to 1st sequence and no sequence for last. Also, all files are of same size, but I did not understand that all files are in order except file 8 and 9.

#gene1.fasta
>gene1_file10
ATGC...............................>gene1_file8
ATGC...............................>gene1_file9
NTGC...............................>gene1_file7
ATGC...............................>gene1_file6
NNGC...............................>gene1_file5
ATGC...............................>gene1_file4
ATGC...............................>gene1_file3
ATGC...............................>gene1_file2
ATGC...............................>gene1_file1
ATGC...............................

#gene2.fasta
Same for gene2 and other genes
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinfo8120

What are you using to view the new fasta? I'm confused about what the issue is.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by st.ph.n2.5k

Notepad++ editor. I mean file names are shifted.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinfo8120
1

What do you need to do with the new files? I recommend vi, or nano on the CLI. Notepad sometimes has issues with newlines, in my experience. Word typically handles these okay.

ADD REPLYlink written 2.2 years ago by st.ph.n2.5k

Same issue with vi, vim also.

ADD REPLYlink written 2.2 years ago by bioinfo8120

I don't see why the file is appearing like you posted. I ran a test on this script, and it appears fine. What python version are you using? python --version

try changing this line:

out.write('>' + g + '_' + genes[g][n][0] + '\n' + genes[g][n][1])

to this, to print to stdout and file:

print '>' + g + '_" + genes[g][n][0], '\n', genes[g][n][1]
print >> out, '>' + g + '_' + genes[g][n][0], '\n', genes[g][n][1]
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by st.ph.n2.5k
1

You just need to add a '\n' at the end of the string using the write method.

ADD REPLYlink written 2.2 years ago by Matt Shirley9.2k

I've never had an issue with how it's formatted in the past since within a loop, each print line should print to a new line for the header and another for the sequence, but adding the newline character to the end should fix it.

ADD REPLYlink written 2.2 years ago by st.ph.n2.5k

Thanks Matt Shirley

ADD REPLYlink written 2.2 years ago by bioinfo8120

@st.ph.n

Python 2.7.12

I edited combine.py and ran again, but it showed following error:

  File "combine.py", line 25
    print  '>'  + g + '_" + genes[g][n][0], '\n', genes[g][n][1]
                                                              ^
SyntaxError: unexpected character after line continuation character
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinfo8120
1

just add the '\n' as Matt suggested.

out.write('>' + g + '_' + genes[g][n][0] + '\n' + genes[g][n][1] + '\n')
ADD REPLYlink written 2.2 years ago by st.ph.n2.5k

Thank you very much st.ph.n, it worked :-)
Would you please provide more explanation for the script, if possible?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinfo8120

There are some comment lines. What needs further explanation?

ADD REPLYlink written 2.2 years ago by st.ph.n2.5k
genes[line.strip().split('>')[1]].append((file.split('.fasta')[0], next(f).strip()))

The meaning of [1] and [0]?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinfo8120
1

[int] specifies in index, in this case line.strip().split('>')[1] is splitting the header into '>' and 'gene name', and python starts at 0 so we're taking the gene name. Sae with splitting the file name into parts, 'file name' and '.fasta', and we're taking the file name.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by st.ph.n2.5k
1
gravatar for Matt Shirley
2.2 years ago by
Matt Shirley9.2k
Cambridge, MA
Matt Shirley9.2k wrote:

You can use pyfaidx to do this in one step:

$ pip install pyfaidx
$ faidx -x -e "lambda x: x + '_sample1'" test.fasta

The faidx script is comparable to the pyfasta script, but has many more features. It also creates/uses the .fai index, just like samtools. The -x argument splits the input file into separate files, named using the header. The -e argument takes a python lambda function that returns a modified header name, allowing you to do complex transformations. If you don't pass any region positional arguments, faidx will operate over all sequences in the input FASTA file, so you don't need a bash loop.

ADD COMMENTlink written 2.2 years ago by Matt Shirley9.2k

@Matt Shirley Thanks, it will take only one fasta file at a time. My aim is to extract the genes with similar gene name from all the fasta files and make gene specific fasta files with updated header including gene name and file name (so that I should know from which file that gene came) + append the gene sequences in the file with that gene name. Here are the sample input and output files:

Input files:
#file1.fasta

>gene1
ATGC
>gene2
ATGA
>gene3
ATGTTT

#file2.fasta

>gene1
ATGG
>gene2
ATGC
>gene3
ATGTT

Expected output files:

#gene1.fasta
>gene1_file1
ATGC
>gene1_file2
ATGG

#gene2.fasta
>gene2_file1
ATGA
>gene2_file2
ATGC
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinfo8120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1524 users visited in the last hour