To create a multi fasta file which contains each gene's exons
2
0
Entering edit mode
6 months ago
Berk • 0

Can you do this , this is beyond my learnings.

orf Python fasta • 977 views
3
Entering edit mode

I was gonna post 'sounds like a homework assignment' ... but this is beyond obvious apparently :)

0
Entering edit mode

Yeah , good instinct.😂

0
Entering edit mode

So the question is ,can someone do it? İ cant do this with my knowledge 😁

0
Entering edit mode

perhaps one of your team mates ? (I see it's a group assignment ;) )

1
Entering edit mode

Hi Berk

this is beyond my learnings

I guess they are the objectives of the assignment itself;

I'd also think that -as good start for a fruitful discussion- you could describe up to where your learnings brings you along the task.

1
Entering edit mode
6 months ago

The bed2faidxsta.pl script will convert your ORFs to FASTA. This gets you halfway there; you'll need to stitch exons together afterwards:

You'll need to install samtools and likely use the --fastaIsUncompressed option, i.e.:

$bed2faidxsta.pl --fastaIsUncompressed < ORFs.bed > ORFs.fa  You'll need a file called NC_007605.1.fa in the same directory as ORFs.bed, which contains the sequence data for that chromosome. You'll also need to edit line 67 to get FASTA headers to look the way you want. I won't explain how, but it should be straightforward from the code. As a further hint, how you process $id can help you stitch exons together further down the line.

Note: In the future, try to avoid posting graphics to questions. Graphics can't be indexed for search engines, the way text can. This question may get deleted by a moderator, for that reason.

0
Entering edit mode

I tried this on python after installing samtools ,tested samtools for if its working and its ok. But code errored for some reason , even i edited line67.How can i fix it ?

1
Entering edit mode
#!/usr/bin/env perl


is not python

0
Entering edit mode
6 months ago
Berk • 0
handle = open("new_file.txt", "w")
handle.write("My FIRST line of text.\n")
handle.write("My SECOND line of text.\n")
handle.write("My THIRD line of text.\n")

handle.close()
handle = open("new_file.txt","r")
print(saved_line+ "is my favorite.")
print(saved_line.strip() + "is my favorite.")
for line in handle:

print (line.strip())

handle.close()
import sys

GeneBedfile = sys.argv[1]

with open(GeneBedfile) as bfile:
for line in bfile:
start = int(line.strip().split("\t")[1])
end = int(line.strip().split("\t")[2])
geneName = str(line.strip().split("\t")[3])
print(geneName, "gene", "starts at", start, "ends at", end)

1
Entering edit mode

fyi we have markdown support. Select the code and press the 10101 button or simply use the markdown backticks like below to highlight your code:

python
python...code



Or alternatively paste the code into a Gist, and then simply copy/paste the link here as Alex Reynolds does above, the code will then be displayed automatically.

0
Entering edit mode

İ think i didint understand it , can you explain it again , i need to do what alex did on python 😅.If you guys make me do this correctly , you ll do a great thing for.someone

1
Entering edit mode

You won't learn much if people write out the full solution for you. You have been given about 80% of the solution, both in Python and Perl. Work out the rest.

1
Entering edit mode

You'll need the chromosome as well. If you want to use Python, instead of Perl, here's one thought:

with open(geneBedFilename, 'r') as bedFileHandle:
for line in bedFileHandle:
(chromosome, start, end, geneName) = line.rstrip().split('\t')
start = int(start)
end = int(end)


Once you have this, you can build a samtools query with the values of chromosome, start, and end. You could then run that query with the subprocess Python library. You then take the output sequence and write that out with the header, via geneName. Remember that you still need to stitch exons, but this gets you most of the way there.

Another way to do this — without using samtools — is to read the genome FASTA into memory, into a Python dictionary.

In a dictionary key-value pair, the chromosome name is the key and the chromosome's sequence is the value:

currentChromosome = ''
genomeFasta = {}
with open(genomeFastaFilename, 'r') as genomeFastaFileHandle:
for line in genomeFastaFileHandle:
line = line.rstrip()
if line.startswith('>'):
currentChromosome = line.lstrip('>').split(' ')[0]
genomeFasta[currentChromosome] = ''
else:
genomeFasta[currentChromosome] += line


Once you have read the viral genome into memory, you can then use the start and end values as offsets into the chromosome's sequence:

with open(geneBedFilename, 'r') as bedFileHandle:
for line in bedFileHandle:
(chromosome, start, end, geneName) = line.rstrip().split('\t')
start = int(start)
end = int(end)
sequence = genomeFasta[chromosome][start:end-1]
sys.stdout.write('>{}\n{}\n'.format(geneName, sequence))


This approach is okay for a viral genome as they don't tend to be very long (the longest known virus is about 1.3Mb). But this approach is probably not so practical for larger genomes, like human or mouse, where an indexed search is preferred. You may be better off learning how this is done via samtools.