Question: The efficient way to generate fasta and bed files
0
gravatar for elisheva
2.5 years ago by
elisheva100
Israel
elisheva100 wrote:

Hi everyone!
I have big fasta files (about 1-2 GB ) received by bedtools getfasta command each file has a header with information (chromosome, coordinates, and strand) and a sequence of 10 nucleotides.
I have to select only the sequences which have pyrimidines in 4-5 columns and generate fasta and bed files for them.
I wrote a python script for it, but the problem is it takes a lot of time. here is my code:

#This script generates new files which contain only pyrimidines.
from Bio import SeqIO
import sys
import traceback
import warnings
import argparse
PYRIMIDINES = ["TT","TC","CT","CC"]

def bedFormat(Id):
    """This function manipulates the fasta file header for getting appropriate bed file """
    chromosome = Id[Id.find(':') +2:Id.rfind(':')]
    start = Id[Id.rfind(':') +1:Id.find('-')]
    end = Id[Id.find('-') +1:Id.find('()')]
    strand = Id[Id.find('>') +1:Id.find(':')]
    lineBed = (chromosome,start,end,strand)
    line = "\t".join(lineBed) + "\n"
    return line


def writeFiles(path_to_directory,path_to_file,name):
""" This function gets fasta file and writes new filtered bed and fasta files according to the original file"""
    bed_file = (open(path_to_directory + "/filtered_bed/" + name + ".bed", "wb+")) #filtered bed file.
    fasta_file = (open(path_to_directory + "/filtered_fasta/" + name, "wb+")) #filtered fasta file.
    with open(path_to_file, mode='r') as handle:
        for record in SeqIO.parse(handle, 'fasta'):
            seq = record.seq.upper()
            if seq[3:5] in PYRIMIDINES: #If there are pyrimidines on the 4 and 5 columns. 
                bed_file.writebedFormatrecord.id))
                SeqIO.write(record, fasta_file, "fasta")
    bed_file.close()
    fasta_file.close()


def main():
    path_to_directory = sys.argv[1] #The directory path
    path_to_file = sys.argv[2] #The file's path
    name = sys.argv[3] #The file's name
    writeFiles(path_to_directory,path_to_file,name)

if __name__ == '__main__':
    main()

For example, if this is the input file:

>+::chrY:59351341-59351351()
tttcctagcc
>+::chrY:59352450-59352460()
agctttagta
>+::chrY:59352832-59352842()
ggttttaggg
>-::chrY:59358570-59358580()
GCTGAGAGCC
>-::chrY:59363409-59363419()
ggttaggggt

the desired output fasta file is:

>+::chrY:59351341-59351351()
tttcctagcc
>+::chrY:59352450-59352460()
agctttagta
>+::chrY:59352832-59352842()
ggttttaggg

without the sequences:

>-::chrY:59358570-59358580()
GCTGAGAGCC
>-::chrY:59363409-59363419()
ggttaggggt

which have GA or TA in 4,5 columns.

and the matching desired output bed file - contains only the sequences in the fasta file is:

chrY    59351341        59351351        +
chrY    59352450        59352460        +
chrY    59352832        59352842        +

Can anyone suggest a better solution for this goal?

bed python fasta • 893 views
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by elisheva100
1

please, comment or validate your previous questions:

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum131k

show us an example of input and of desired output.

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum131k

I don't understand why you want to convert a fasta to a bed file ? Fasta files contain only the primary DNA sequence not genomic coordinates ? Could you explain a little bit more ? Are the genomic informations in the fasta header ?

ADD REPLYlink written 2.5 years ago by Nicolas Rosewick9.1k

what's the difference between input.fasta and output.fasta ? you're not explaining us what you're trying to do, we're not medium.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Pierre Lindenbaum131k
3
gravatar for Pierre Lindenbaum
2.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

assuming two lines per fata record:

cat input.fa  |\
paste - -  |\
awk 'toupper(substr($2,4,2)) ~ /^(TT|TC|CT|CC)$/' |\
tr "\t" "\n" |\
tee output.fa |\
awk -F '[>(:-]' '/^>/ {printf("%s\t%s\t%s\t%s\n",$4,$5,$6,$2);}'


chrY    59351341    59351351    +
chrY    59352450    59352460    +
chrY    59352832    59352842    +

$ cat output.fa

>+::chrY:59351341-59351351()
tttcctagcc
>+::chrY:59352450-59352460()
agctttagta
>+::chrY:59352832-59352842()
ggttttaggg
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Pierre Lindenbaum131k
0
gravatar for Bastien Hervé
2.5 years ago by
Bastien Hervé4.8k
Karolinska Institutet, Sweden
Bastien Hervé4.8k wrote:

As said in comment, it's hard to unsterstand what you want to achieve, but given your examples :

I would say that your writeFiles is well wrote, except this line bed_file.writebedFormatrecord.id)) and the path finding

I assume that you want to pass your record.id to your bedFormat function, then to write the result in a file :

from Bio import SeqIO
import sys
import traceback
import warnings
import argparse
import re
PYRIMIDINES = ["TT","TC","CT","CC"]

def bedFormat(Id):
    """This function manipulates the fasta file header for getting appropriate bed file """
    splitted_line = re.split(':|-',Id[:-2])
    splitted_line.remove('')
    return "\t".join([splitted_line[i] for i in [1,2,3,0]]) + "\n"

def writeFiles(path_to_directory,path_to_file,name):
    """ This function gets fasta file and writes new filtered bed and fasta files according to the original file"""
    bed_file = open(path_to_directory + "/filtered_bed/" + name + ".bed", "a") #filtered bed file.
    fasta_file = open(path_to_directory + "/filtered_fasta/" + name + ".fasta", "wb+") #filtered fasta file.

    with open(path_to_file+"/"+name+ ".fasta", mode='r') as handle:

        for record in SeqIO.parse(handle, 'fasta'):
            seq = record.seq.upper()

            if seq[3:5] in PYRIMIDINES: #If there are pyrimidines on the 4 and 5 columns. 
                bed_line = bedFormat( record.id) ###There is no space between "(" and "r" but biostars eat my "(" without adding a space
                bed_file.write(bed_line)
                SeqIO.write(record, fasta_file, "fasta")
    bed_file.close()
    fasta_file.close()

def main():
    path_to_directory = sys.argv[1] #The directory path
    path_to_file = sys.argv[2] #The file's path
    name = sys.argv[3] #The file's name
    writeFiles(path_to_directory,path_to_file,name)

if __name__ == '__main__':
    main()

I guess that all these rfind and find take to much time. Do you win time with my bedFormat function ?

How many reads do you have ?

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Bastien Hervé4.8k

Thank a lot for the solution, but for some reason, it took even much time

ADD REPLYlink written 2.5 years ago by elisheva100

what is the size of your input file ?

ADD REPLYlink written 2.5 years ago by Nicolas Rosewick9.1k

Try to supply more informations about your input files and try to investigate the running time of each function in the script above.

ADD REPLYlink written 2.5 years ago by Bastien Hervé4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1839 users visited in the last hour