Manipulation Of Large Numbers Of Sequences Using Python Or Matlab
2
0
Entering edit mode
10.5 years ago
zeropoint1 ▴ 10

Hello,

I have millions of lines in the following format:

number sequence

The number refers to the frequency at which the sequence appears in my library and the sequence is as you would expect. A typical line might look like:

25565 AGTGCATTTTGGTTTAGGCATGA

Thus, this particular and fictitious sequence shows up 25565 times in my library.

I need to manipulate this data in the following way:

1) Confirm that the final 5 letters are correct (in this case, CATGA) and if not, remove the line.

and then

2) Remove the final 5 letters from all of the sequences on every line.

I have been trying to figure out how to load this information into either python as a dictionary or directly into matlab.

It would be very helpful to know whether this feat would be best approached with matlab, python, or something else. Also, how would it be best to load the data from the text file into a dictionary in python?

Thanks!

python matlab ngs • 3.9k views
ADD COMMENT
2
Entering edit mode
10.5 years ago

If you have millions of lines, it is better to stream through the file one line at a time rather than to read the entire file into a data structure (python dictionary, array...). You can do this with something like:

inFile = open('inputFile.txt')
for line in inFile:
    data = line.strip().split()
    count = int(data[0])
    sequence = data[1]
    #do something with your count and sequence variables.

I don't understand what you mean by "Confirm the final 5 letters are correct"? What do you mean by correct?

ADD COMMENT
0
Entering edit mode

I guess he/she might be looking for a primer or TAG

ADD REPLY
0
Entering edit mode

If the last 5 letters don't match an expected string, then the line must be discarded. This means that the sequences was misread and should not be considered. Thank you for your advice.

ADD REPLY
1
Entering edit mode

If s is a string in python. Then the last five letters are just: s[-5:] So,

if (sequence[-5:] == 'CATGA'): #do something
ADD REPLY
2
Entering edit mode
10.5 years ago
Song Qiang ▴ 40

You may use a sed one-liner. Suppose the input file is in.txt and the output file is out.txt, run

sed -n '/CATGA$/ s/CATGA$//p' < in.txt > out.txt

ADD COMMENT
0
Entering edit mode

very nice sed one liner!

ADD REPLY

Login before adding your answer.

Traffic: 2689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6