Question: Help with randomly sampling from a fasta file
0
gravatar for tpaisie
2.1 years ago by
tpaisie70
University of Florida
tpaisie70 wrote:

Hey guys, I'm trying to make a script that will take a fasta file with many sequences from one patient at different time points and then randomly sample one sequence from each time point. For example here are some sequence title names:

01P03Pr01
01P03Pr02
01P03Pr03
01P03Pr04
01P03Kr01
01P03Kr02
01P03Kr03
09P03Pr01
09P03Pr02
09P03Pr03
09P03Kr01
09P03Kr05

Then these are the random sequences that were taken out of the larger fasta file and put in a new one:

01P03Pr02
01P03Kr01
09P03Pr03
09P03Kr05

Hopefully that makes sense. I'm a beginner with coding in python and want to improve so I would appreciate a nudge or some help. I'm using random.sample in my script and i'm not really sure where to start in terms of whether or not to make a dictionary or index, or none of those. Any help would be appreciated!!!

sequencing python fasta • 1.6k views
ADD COMMENTlink modified 2.1 years ago by genomax65k • written 2.1 years ago by tpaisie70
2

Have a try seqtk sample

ADD REPLYlink written 2.1 years ago by pengchy410

Are all the sequences in a single FASTA file? If so, you may explain the structure of sequence names

01P03Pr04

What are the time points and patient ID and body parts? 01, P03Rr, 04??

If not, just sampling every file.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by shenwei3564.5k

so for the sample you wrote its:

01 = month of sampled P03 = Patient ID P = kind of tissue r = RNA 04 = sequence number

I need a script that prompts me for how many sequences I would like from each tissue. The example I put in the OP is asking for one randomly sampled sequence per tissue per time point.

ADD REPLYlink written 2.1 years ago by tpaisie70

read sequences using Biopython, parse the name to get month and tissue information, save sequences in dict with structure like seqs[month][tissue]=list of sequences, and sample for per tissue per time point.

ADD REPLYlink written 2.1 years ago by shenwei3564.5k

Thanks for the tip! I will try to write a script for this and if it doesn't work i'll post my script.

ADD REPLYlink written 2.1 years ago by tpaisie70
0
gravatar for genomax
2.1 years ago by
genomax65k
United States
genomax65k wrote:

Use reformat.sh from BBMap suite.

To sample 10% of reads

reformat.sh in=reads.fa out=sampled.fa samplerate=0.1

and for exact sampling (100 reads):

reformat.sh in=reads.fa out=sampled.fa samplereadstarget=100

ADD COMMENTlink written 2.1 years ago by genomax65k

So I don't want to just sample 10% of reads nor are these fastq files, its a MSA fasta file. I'll have 100 sequences in my fasta file from one patient, already aligned, and 25 sequences will be from the same patient from the same body part at different time points. So that means there would be 4 different sets of 25 sequences from the same patient each from a different body part taken at different time points. I want to randomly sample 1 sequence from each set of 25 and end up with 4 sequences in my output fasta file. Does that make sense?

ADD REPLYlink written 2.1 years ago by tpaisie70
1

If you noticed in the example I provided above I am using fasta (.fa) files. Since aligned fasta files are not a standard format (and depending on how they are broken e.g. line wrapping) reformat.sh may (or may not) work. Won't be that much effort to try since BBMap does not need any install or dependencies (besides java). If you have 4 sets of 25 sequences then you should split them into 4 files and then sample 1 sequence out of the 4 files independently reformat.sh in=reads_1.fa out=sampled_1.fa samplereadstarget=1 I don't recollect if BBMap will append a file but you can always cat those 4 sampled files together.

ADD REPLYlink written 2.1 years ago by genomax65k

Yeah I know i can do that, but the point is for me to write a python script for doing what I want to do from one fasta file.

ADD REPLYlink written 2.1 years ago by tpaisie70
2

If you are writing a script yourself then it is generally customary to include that as well in your post, if you are running into difficulties. Also make it clear in the original post (by editing it) that you want to do this yourself and don't want ready made/other solutions.

We are generally suspicious when solutions based on a specific technology (e.g. python) are requested when others could be easily used. But learning is always encouraged on Biostars and as long as help is not being requested for homework assignments people will chip-in to help.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax65k

How can we recognize the times points and body parts from the sequence names, you should tell us all.

ADD REPLYlink written 2.1 years ago by shenwei3564.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 798 users visited in the last hour