random sampling genome.fa
1
0
Entering edit mode
3.6 years ago
Chironex ▴ 50

I have a genome of about 2 gb composed by scaffolds I would random sample the genome.

I used reformat.sh but the output was only a scaffold. I need 1/3 of the total genome... I report only some scaffolds as example:

>LGKD01000001.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold4_contig_1, whole genome shotgun sequence
GAACAGCATGAATGTTAAAACtgaaatggatgatgatgatgatgatgatgatgatgatggcagcaacAGCCatgattatatttaatatgttgttagttataatcataataatgatgataatgttgataacaaTAATGGTTGCAATAATG
>KQ415657.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 unplaced genomic scaffold Scaffold5, whole genome shotgun sequence
tatatatatatagtcaattcgagGATGTTAGATCGACAATGGGGATTATAGAATCCCACAAAAAATTCCACTGGT
>LGKD01000032.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold12_contig_1, whole genome shotgun sequence
GAAGTGGTAAAGAGTgcgatgcgctgaaaaaagagagaacagtacttgaaatGTGGTTTCATTCTagtagtaaat
>LGKD01000033.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold16_contig_1, whole genome shotgun sequence
ctgaTCAACAGAatagggccaatcattcttcatgacaatgctcgaccacacgttttaCTAATGA
>LGKD01000034.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold22_contig_1, whole genome shotgun sequence
TTATCTATATACGagaatattatctatatataaaggaataccaaaaaaacaagaacaacgggtcattcggaattttcttt


There is a script able to do this?

genome fasta • 1.5k views
0
Entering edit mode

You would need to collapse the entire genome sequence into one long fasta and then sample if you truly need 1/3 of total.

Edit: Can you provide exact command line you used?

0
Entering edit mode
reformat.sh in=my file.fa ou=newfile.fa samplerate=0.3

0
Entering edit mode
3.6 years ago

Something like this would do:

$NSAMPLE=2 # Number of sequences you want$ cat genome.fa  | tr '\n' '\t' | tr '>' '\n' | grep -v "^$" | sort -R | head -n$NSAMPLE | sed -e 's|^|>|g'  | tr '\t' '\n' | grep -v "^$" >LGKD01000001.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold4_contig_1, whole genome shotgun sequence GAACAGCATGAATGTTAAAACtgaaatggatgatgatgatgatgatgatgatgatgatggcagcaacAGCCatgattatatttaatatgttgttagttataatcataataatgatgataatgttgataacaaTAATGGTTGCAATAATG >LGKD01000034.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold22_contig_1, whole genome shotgun sequence TTATCTATATACGagaatattatctatatataaaggaataccaaaaaaacaagaacaacgggtcattcggaattttcttt  The idea is to convert every sequence as one line including the header and the sequence itself, separated by tabs. sort -R gives you a random order, and head -n returns a certain number of sequences (i.e. samples since your input is shuffled) and the rest just converts it back to the format it had before (convert tabs back to new lines.) You can figure out the NSAMPLES to get 1/3 of the total by doing grep -c ">" genome.fa and dividing that number by 3. ADD COMMENT 0 Entering edit mode I have a feeling OP is not looking to get a third of fasta sequence entries but rather a third of the actual sequence. It is kind of an odd request if that is true. Perhaps OP will clarify. ADD REPLY 0 Entering edit mode Oh interesting. Possible solution for something like that then: # Randomly get 1/3 of each non-header sequence awk ' BEGIN{ srand(); }; /^[^>]/ {third=(length($1)/3); offset=(rand() * (third * 2) ); $1=substr($1,offset, offset + third) } 1' genome.fa

0
Entering edit mode

Yes, I need only a part of the fasta file. Infact I'm using RepeatScout to find ripetitive elemnts, but the genome is 2 gb and it doesn't work with such a big file. So, the idea is to obtain only a part of the genome because ripetitive sequences are interspersed in the genome

0
Entering edit mode

Yes, I need only a part of the fasta file.

I don't think that is right. You seem to have a multi-fasta file so you need as many entries as needed to make roughly third of total (e.g. if you had 15 fasta entries then you would make 5 x 3 files). That is going to be complicated by fact that the sequences are not of equal length, correct?

Why not approximately cut the file into three (or how many ever) pieces?

0
Entering edit mode

Hi genomax, I'm the user fripeki but I could not post anymore with that account (reaching max limit of 5 post for day). I Appreciated your idea. Can you suggest me a command to cut the file into 3 approximately equal parts? Sorry I'm learning bioinformatic so I don't know all the solutions

1
Entering edit mode

faSplit utility from Jim Kent will also work (linux version linked, do chmod a+x faSplit after downloading). This will be a clean solution compared to split but it may not make the files exactly equal in size (depending on what you have in your files). Change number 3 to desired pieces.

$./faSplit sequence genome.fa 3 splt_  this will generate files splt_0.fa splt_1.fa splt_2.fa  ADD REPLY 0 Entering edit mode ok! thank you very much for your help! ADD REPLY 0 Entering edit mode You can use the unix command split to split the files. Adjust the number 3 to suit your purpose. # Following will split the file into three equal pieces. Split files will get names with prefix (spl_)$ split -n 3 genome.fa spl_

# this will result in three equal size pieces

spl_aa
spl_ab
spl_ac


This will most likely split the file in a way that the second and third files will not start with a valid fasta header. You can look at the file 2 and 3 by

$head -3 splt_ab  to confirm this. You can insert a valid fasta header by doing the following for file 2 (and 3 etc) $ sed -i '1 i\>begin_file2' spl_ab
\$ sed -i '1 i\>begin_file3' spl_ac