Question: Compairing A 454 Assemblage With A Gene Expression Analysis In Python
2
gravatar for Nico
8.1 years ago by
Nico30
United States
Nico30 wrote:

HELP? i need to map 20bp sequences produced by a gene expression analysis (fastq)to the contigs produced by a 454 assemblage (fasta), here is the kicker i dont care about the tags mapped to the open reading frame (ORF) range of the contigs. i have a file with the ORF ranges for each contig in the assemblage i just can’t combine all 3 ideas, im a noob so sorry if this is super cakie i have an idea but am a poor programmer, match the names of the contigs in the assemblage file and the ORF finder file then delete all the sequence in the ORF range. taking this new ORF free assemblage and mapping the short tags to it, only allowing for one (20bp) tag to contig(ORF free) match .

so i want to locate the 20bp tags in the 454 assembalage out side of the ORF range.

example data in files. (files up to 2GB)

20bp tags

GATCCAGGATGGAGTCTTTG
GATCCTGAAGATGGAGTCTC
GATCCTTCCACACAAAGTGG
GATCACTGGACCGGCGGCCA
GATCTTTAGTCTAAAGTGTC

the 454 assemblage

GD4HFQN03F001A tcagacgagtgcgtGCGCACCGGACCAGAAAGGCCAGGCTGGGCCTGAGCAAGGGGAGGCAGTGGGTACGACCACGGCCagcggcgcggccaattccgggaccgagccaaggcagcggcgggaggcggctttggtcacgggacgtcttcctgcggaggtgggcccgtcgacagtcatggaaatgacctcttccccaccggaggaggacgagaccgacgggtggcggcggat
GD4HFQN03F00HK tcagacgagtgcgtCCATTACGGCCGAAAACCTCACCTTACCTCAACATAAAACAATACAACACAGCAACACTCAAACCCCACACCCTCTCAATAACCAACAACAAATCTTGAAAAACATGTCTTCCGGCCTCACCAACGAAAAGGGCCAAGGCGCCTCGCACGCAACCCAGTCCTCCGTCCCCGGCAAAGTCCAGGACAAGGCCCCCAAGTCC
GD4HFQN03F00IW tcagacgagtgcgtGCTCTCGCCGGATCTAGATTGGTGGTTCCATGGTCCGCCGTCACACCAGTTCTTGATGCGATTGCTCACCGGTGTTTCATCACCAGACGACGTCGATTTCCAATTTCAGCCCCAATCCTTCGCCTCATTCGGACCCACCGTCCTTGTCGAAGGCTGCGATTCCGATCTGTCCATAGCTTGGGTTCACGCCTGGACCGTGACGGATGGGATAATTACCCAGGTTAGGGAGTACTTCAACACTTCCCTCACCGTCACCCGCCTCGGCCCGTTACGTAGCGTATCGTTGacctgagactgccaaggcacacaggggataggn

the ORF ranges

GD4HFQN03F001A 176,445
GD4HFQN03F00HK 119,409
GD4HFQN03F00IW 70,303

______________________________________NEW_______________________________________________

the 454 assembly with the ORF number appended at end csv format

Fiesta accessions,GHA sequence,Field3
>GD4HFQN03F001A,tcagacgagtgcgtGCGCACCGGACCAGAAAGGCCAGGCTGGGCCTGAGCAAGGGGAGGCAGTGGGTACGACCACGGCCagcggcgcggccaattccgggaccgagccaaggcagcggcgggaggcggctttggtcacgggacgtcttcctgcggaggtgggcccgtcgacagtcatggaaatgacctcttccccaccggaggaggacgagaccgacgggtggcggcggatcttcgcgtcctcgttggtcttcttgccttcggcgaggagggaggcttacggaactcctcgcagtcgccgaggacgaggttcgatgtggcggtcgaaggccgatgaacttgccgacaagctggcggaccatcctggatggtcgacccgccatgcggtagttgatgtactggaggcatcttggagctctctgagactgccaaggcacacaggggagtaggnnn,445
>GD4HFQN03F00HK,tcagacgagtgcgtCCATTACGGCCGAAAACCTCACCTTACCTCAACATAAAACAATACAACACAGCAACACTCAAACCCCACACCCTCTCAATAACCAACAACAAATCTTGAAAAACATGTCTTCCGGCCTCACCAACGAAAAGGGCCAAGGCGCCTCGCACGCAACCCAGTCCTCCGTCCCCGGCAAAGTCCAGGACAAGGCCCCCAAGTCCGTCGAAGACAAGCTCCCCGACGCCGTCCACGACGCCGGCAGCAACCCTCACACCGGCAAGGTTTCGCACGCCACCGGCGACTCCAAGGTGCCGCAGACGCTGCAGGAGGGCCTGCCGGAGAAGGTCGAGAAGATCGTGCCCAACAAGCTGCACGACACCAAGGGCGCGACTTTCAGCGATGGATCTGTTGGCAAGTAAAAAGACCAATTCCATTCGGTGGGTTTAGTATGACTATGATTTGTATGATAGAAATTATgacttgaaatggaagatttgacgaacgactgagactgccaaggcacacaggggataggnnn,409

__________________________FINAL CODES THAT WORK___________________________________________________ the code that deleted all sequence but 5'UTRs

import csv
gseq=0;orfn=0
lines = csv.reader(open('test2.1GHA'))
lines.next()
for line in lines:
    ID=line[0]
    gseq=line[1]
    orfn=line[2]
    lseq=list(gseq)
    utrstart=int(orfn)
    del lseq [:utrstart]
    utrseq=''.join(lseq).upper()
    utrseq.upper()
    print ID, '\n', utrseq,'\n', orfn
    #i then copy and pasted results into a  Notepad txt file
    #i need help with the output to a file, seemed to only write one line when i tried

this finds the short reads in the 5'UTRs

from Bio import SeqIO               # Bio Python
fh=open('test2GHA.txt')
for record in SeqIO.parse(fh, 'fasta'):
    id=record.id
    seq=record.seq
    useq=seq.upper()
    tfh=open('test5tags.txt')
    taglines=tfh.readlines()
    for line in taglines:
        tag= line.upper()
        lntag=tag.replace('\n',',')
        seplntag=lntag.split(',')
        for x in seplntag:
            sonetag=''.join(x)
            if useq.find(sonetag) >=1:
                print id, sonetag
            else:
                pass

i hope this might help some other newbies out there.

thanks again guys, for the help

FINAL TWEAK i need it to output the position of the tag in the UTR (index dosent work) or just find the tag closest to the end of the UTR sequence instead of all of them.
suggestions?

got it sorry..

   if useq.find(sonetag) >=1:
                ftag=useq.find(sonetag)
                print id, sonetag, ftag+1  ##position would be nice
            else:
                pass
gene python orf fasta • 2.5k views
ADD COMMENTlink modified 3.4 years ago by Biostar ♦♦ 20 • written 8.1 years ago by Nico30
1

Hi Nico. One pre-requisite of doing anything in Python or any other language is to actually know that language. I suggest you decide which language would be beneficial for you and that you take time to learn it. People on a forum can be extremely helpful, but they won't do the work for you, only make suggestions to help you get moving. Cheers

ADD REPLYlink written 8.1 years ago by Eric Normandeau10k
1

Nico, not at all. If you have code already, it would be a very helpful starting point for answers; no worries about how presentable it is. Learning to program is hard work; having people look at your code and offer suggestions is a valuable way to get better.

ADD REPLYlink written 8.1 years ago by Brad Chapman9.3k
1

Your 454 'assemblage' looks a lot like individual reads. The read ID is typical of 454 reads, the first four bases (tcag) are the key sequence and the next ten lower case bases are one of the MIDs from 454...

ADD REPLYlink written 8.1 years ago by lexnederbragt1.2k

Your approach is correct: trim the 454 contigs by removing the ORF regions; then map the tags to these trimmed contigs with a short read aligner. Can you include the code you've written so far to do the trimming? This will help someone provide you with a useful answer.

ADD REPLYlink written 8.1 years ago by Brad Chapman9.3k

I apologize for coming off this way in my post, I have been attempting to learn python, I started with “Bucky’s” youtube tutorial. I then read Allen Downey’s book. Yet, I still think like a human biochemist, ha! Then I found Paulo Nuin’s blog and I started Sebastian Bassi’s book. I am feeling some pressure from my committee and was discouraged, maybe a little desperate when I posted this. I meant no disrespect to this form. I have some code written up and when I feel like it is more presentable I will post it. I look forward to your suggestions. Thanks, nico

ADD REPLYlink written 8.1 years ago by Nico30

Hi. From what you describe, I guess you have already covered much ground in your learning of Python. Do show us the code! :)

ADD REPLYlink written 8.1 years ago by Eric Normandeau10k

Ok, after a little work I have added some code to my post. I found that I only need the 5'UTR, so I appended the highest ORF number to the 454 assembly in a csv format (using an make table Access Query, sorry…) then I ran the first code I posted. This output the 454 assembly 5’UTR only. I was pumped!(needs work on the out) As for the second code that finds the 20bp tags in the newly modified assembly, I wrote a version of this that works when tags are entered one at a time. I have 6milloin-ish so I’ve tried to modify it to take my file, to no prevail, and help would be very much appreciated!

ADD REPLYlink written 8.1 years ago by Nico30

Ok, after a little work I have added some code to my post. I found that I only need the 5'UTR, so I appended the highest ORF number to the 454 assembly in a csv format (using an make table Access Query, sorry…) then I ran the first code I posted. This output the 454 assembly 5’UTR only. I was pumped! As for the second code that finds the 20bp tags in the newly modified assembly, I wrote a version of this that works when tags are entered one at a time. I have 6milloin-ish so I’ve tried to modify it to take my file, to no prevail gonna keep at it let me know if you guys have any ideas. cheers

ADD REPLYlink written 8.1 years ago by Nico30

Ok, after a little work I have added some code to my post. I found that I only need the 5'UTR, so I appended the highest ORF number to the 454 assembly in a csv format then I ran the first code I posted. This output the 454 assembly 5’UTR only. I was pumped! As for the second code that finds the 20bp tags in the newly modified assembly, I wrote a version of this that works when tags are entered one at a time. I have 6milloin-ish so I’ve tried to modify it to take my files, to no prevail.. im going to keep at it, but let me know if you guys have any ideas. cheers

ADD REPLYlink written 8.1 years ago by Nico30

I’m having a problem now with several tags showing up in one UTR, I need only the one nearest to the end of the UTR sequence, and .index doesn’t work because Seq (embedded in biopython) doesn’t have attribute .index any suggestions? Possibly just an addition to the print call?

ADD REPLYlink written 8.1 years ago by Nico30

nevermind got it sorry

ADD REPLYlink written 8.1 years ago by Nico30
5
gravatar for Brad Chapman
8.1 years ago by
Brad Chapman9.3k
Boston, MA
Brad Chapman9.3k wrote:

Nico, thanks for posting your code. I didn't have a chance to go through the code in detail, but for your current 'yep/nope' issue, you want to change:

if tag in useq:

to

if useq.find(tag) >= 0:

The 'a in b' syntax tests whether the item a is contained within the list b. If I understand your intention correctly, you want to determine if the tag is found within the larger useq. For this, use the string.find(c) syntax to search through the string for any instances of c.

ADD COMMENTlink written 8.1 years ago by Brad Chapman9.3k

This seems to be a step in the right direction, output didn’t change but I’m looking further in to this change. (tag) represents a file of~2million tags

ADD REPLYlink written 8.1 years ago by Nico30

Nico, glad this helps. Beyond the actual code, it would be worth trying a different approach to solve your problem. It sounds like you are re-writing a short-read aligner. Once you have your ORFs prepared as a FASTA file, use an aligner like bowtie (http://bowtie-bio.sourceforge.net/) or bwa (http://bio-bwa.sourceforge.net/) to map the reads to the genome. These you can use Python with pysam (http://code.google.com/p/pysam/) to get the information you want from the alignments.

ADD REPLYlink written 8.1 years ago by Brad Chapman9.3k

Yes but far less complex, the 454 data is not a genome ie. It has no chromosomal data with it, I am going to explore these suggestions Thanks so much for your input so far

ADD REPLYlink written 8.1 years ago by Nico30

Short-read aligners are designed to rapidly search a large number of small sequences against a base sequence. The base does not have to be "traditional" chromosome assemblies to take advantage of this. I'm just mentioning this because you will find Python slow for searching millions of tags against lots of base sequences.

ADD REPLYlink written 8.1 years ago by Brad Chapman9.3k

I got it!! Thanks a so much for your direction, I did rewrite my own short-read aligner but being as newbish as I am it was actually easier for me the manipulating someone else’s code. I have posted the code, still would be nice to have one scrip to run it all and the needs some work on the output but these things will come.

ADD REPLYlink written 8.1 years ago by Nico30

I got it!! Thanks a so much for your direction, I did rewrite my own short-read aligner but being as newbish as I am it was actually easier for me then manipulating someone else’s code. I have posted the code, still would be nice to have one scrip to run it all and the needs some work on the output but these things will come.

ADD REPLYlink written 8.1 years ago by Nico30

I’m having a problem now with several tags showing up in one UTR, I need only the one nearest to the end of the UTR sequence, and .index doesn’t work because Seq (embedded in biopython) doesn’t have attribute .index any suggestions? Possibly just an addition to the print call?

ADD REPLYlink written 8.1 years ago by Nico30

got it sorry...

ADD REPLYlink written 8.1 years ago by Nico30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 736 users visited in the last hour