Question: Sample Sam File
9
gravatar for Steffi
7.3 years ago by
Steffi560
Germany
Steffi560 wrote:

Hi,

I have a sam file which contain alignments of about 30 million reads. I just would like to randomly sample e.g. 1 million alignments out of my sam file. Whats the best way to do this? Anything already available in any kind of toolkit? Or bash scripting? Or Perl?

thanks a lot, Steffi

sam random • 13k views
ADD COMMENTlink modified 5.1 years ago by Biostar ♦♦ 20 • written 7.3 years ago by Steffi560
12
gravatar for Fabio Marroni
6.1 years ago by
Fabio Marroni2.3k
Italy
Fabio Marroni2.3k wrote:

even easier and faster:

samtools view -b -s 0.1 infile.bam > ten_percent_of_bam_file.bam

See here https://groups.google.com/forum/#!topic/bedtools-discuss/gf0KeAJN2Cw

ADD COMMENTlink written 6.1 years ago by Fabio Marroni2.3k

Excellent! I wish I had known this before!

ADD REPLYlink written 5.8 years ago by Erik Garrison2.2k

I could not find the -s option on samtools. When I tried using, gave me an error 'invalid option'. Would you know the reason. I am using samtools-0.1.16

ADD REPLYlink written 5.6 years ago by sheetalshetty1360

Version is important. I just found out -s option is not available on version 0.1.16 Also option is cryptic: so I had to write it as -s -1.5 ( to get 50% of reads) -s -1.7 ( to get 70% of reads) 1 is the seed and .5/.7 the percent of reads for sub sampling

ADD REPLYlink written 5.6 years ago by sheetalshetty1360
1

I think I found a problem with samtools -s. 

In version 0.1.19 if you extract two samples using the same random seed you always obtain the same reads.

So if you have several samples to extract, remember to change the random seed. As @sheetalshetty13 pointed out, the random seed is the part of the number before the point. So 0.5 will extract 50% with random seed "0" and 1.5 50% of the reads with random seed "1". 

samtools view -b -s 1.001 delPN01_deb_PU.bam > tmp.bam

samtools view -b -s 1.001 delPN01_deb_PU.bam > tmp2.bam

samtools view -b -s 11.001 delPN01_deb_PU.bam > tmp3.bam

samtools view tmp.bam | head

chr1_22009_22554_0:0:0_0:0:0_94d2    99    chr1    22009    60    100M    =    22455    546    CCTCTCAAAATCTGGGGATTGGAGGCCTAGTAGTAATGGCCTCATTTTGAAGGAGTTGGGAGAAGGAGTGGCCAGCAACCTGGAAGTGATGTTCTCTGAG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_22009_22554_0:0:0_0:0:0_94d2    147    chr1    22455    60    100M    =    22009    -546    TTCTGAACGCCGTTCTTATTGCTAACGAAACCCTTGATTCTAGATTGAAAGACAACAAACCGGGTCTCCTTCTCAAGATGGACATTGAGAAAGCTTTTAA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_28461_29062_0:0:0_0:0:0_ac8c    163    chr1    28461    60    100M    =    28963    602    ATTGATGACTAGGTTCTCTTTCTCTAAGTGAATAGAAAGAATAGACTGTCAAAAGTGAAGTCCAAACCCCTAAAAAAGGTTTATATAAGCTTTTAGTAGG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_28461_29062_0:0:0_0:0:0_ac8c    83    chr1    28963    60    100M    =    28461    -602    CATTAAATTTCCCTCAATAATCACTTTTGATAAAACAATAAATTTGATTCTTTTGAATGATTAGGTGAAACAATTCTAATTATAAAAATATGAAAAATAT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_42712_43232_0:0:0_0:0:0_e728b    99    chr1    42712    60    100M    =    43133    521    TATAATTGTATTACATGTGCATTTGGGGTGTAAATGAACTAGATTTCCTTCTGTTTCAGATTATAGATTTGGCATTTCAACTTTTCATTTCTAATATCTG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_42712_43232_0:0:0_0:0:0_e728b    147    chr1    43133    60    100M    =    42712    -521    TATAAGTATGTACGTAATGCGCCTCACAAACTGAGCCCATCATTCATAGAAACCAGGCAAGATCAACTTTGGTGCAAAGCAGCCCGATCACTTAAATTAG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_117328_117920_0:0:0_0:0:0_35e4    99    chr1    117328    60    100M    =    117821    593    ACTATAGGTCTTTCACTGTTTGAGTTGTAGGATTTTCTATGAAGGCCACAAATGGCAGCGAAAATGCCAAATCATGACTGGCAGATTTTTCCTTAATACA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_117328_117920_0:0:0_0:0:0_35e4    147    chr1    117821    60    100M    =    117328    -593    GTAATGTATGTATTAGTCTCACCTACGCACTGAAGTGAACGAATTATTTCTCATAATGTATCCCTCCCCATTTTTCTGCAGTATGTACAAATATTAGGCA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_134044_134539_0:0:0_0:0:0_4010f    99    chr1    134044    60    100M    =    134440    496    GCAATCGTTTCTAATCAAATTGTGTTAAGTTTTTGTGTTTTTGATAGCTTTTAGCTACACCAAAGCAATCCAAGAATGAAGGAGAGCTGTATATAGTTCA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:23    AM:i:23    X0:i:1    X1:i:1    XM:i:0    XO:i:0    XG:i:0    MD:Z:100    XA:Z:chr13,-17163169,100M,1;
chr1_134044_134539_0:0:0_0:0:0_4010f    147    chr1    134440    60    100M    =    134044    -496    ATCCAGCTCATTTCCATCCGGGTAGGAATTCCATCCGGGTAGAAATGCCATCCGGGTAGGATTTCCATCCGGGTAGAAATTCCATCCGGGTAGAAATTCC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:23    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100

samtools view tmp2.bam | head
chr1_22009_22554_0:0:0_0:0:0_94d2    99    chr1    22009    60    100M    =    22455    546    CCTCTCAAAATCTGGGGATTGGAGGCCTAGTAGTAATGGCCTCATTTTGAAGGAGTTGGGAGAAGGAGTGGCCAGCAACCTGGAAGTGATGTTCTCTGAG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_22009_22554_0:0:0_0:0:0_94d2    147    chr1    22455    60    100M    =    22009    -546    TTCTGAACGCCGTTCTTATTGCTAACGAAACCCTTGATTCTAGATTGAAAGACAACAAACCGGGTCTCCTTCTCAAGATGGACATTGAGAAAGCTTTTAA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_28461_29062_0:0:0_0:0:0_ac8c    163    chr1    28461    60    100M    =    28963    602    ATTGATGACTAGGTTCTCTTTCTCTAAGTGAATAGAAAGAATAGACTGTCAAAAGTGAAGTCCAAACCCCTAAAAAAGGTTTATATAAGCTTTTAGTAGG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_28461_29062_0:0:0_0:0:0_ac8c    83    chr1    28963    60    100M    =    28461    -602    CATTAAATTTCCCTCAATAATCACTTTTGATAAAACAATAAATTTGATTCTTTTGAATGATTAGGTGAAACAATTCTAATTATAAAAATATGAAAAATAT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_42712_43232_0:0:0_0:0:0_e728b    99    chr1    42712    60    100M    =    43133    521    TATAATTGTATTACATGTGCATTTGGGGTGTAAATGAACTAGATTTCCTTCTGTTTCAGATTATAGATTTGGCATTTCAACTTTTCATTTCTAATATCTG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_42712_43232_0:0:0_0:0:0_e728b    147    chr1    43133    60    100M    =    42712    -521    TATAAGTATGTACGTAATGCGCCTCACAAACTGAGCCCATCATTCATAGAAACCAGGCAAGATCAACTTTGGTGCAAAGCAGCCCGATCACTTAAATTAG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_117328_117920_0:0:0_0:0:0_35e4    99    chr1    117328    60    100M    =    117821    593    ACTATAGGTCTTTCACTGTTTGAGTTGTAGGATTTTCTATGAAGGCCACAAATGGCAGCGAAAATGCCAAATCATGACTGGCAGATTTTTCCTTAATACA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_117328_117920_0:0:0_0:0:0_35e4    147    chr1    117821    60    100M    =    117328    -593    GTAATGTATGTATTAGTCTCACCTACGCACTGAAGTGAACGAATTATTTCTCATAATGTATCCCTCCCCATTTTTCTGCAGTATGTACAAATATTAGGCA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_134044_134539_0:0:0_0:0:0_4010f    99    chr1    134044    60    100M    =    134440    496    GCAATCGTTTCTAATCAAATTGTGTTAAGTTTTTGTGTTTTTGATAGCTTTTAGCTACACCAAAGCAATCCAAGAATGAAGGAGAGCTGTATATAGTTCA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:23    AM:i:23    X0:i:1    X1:i:1    XM:i:0    XO:i:0    XG:i:0    MD:Z:100    XA:Z:chr13,-17163169,100M,1;
chr1_134044_134539_0:0:0_0:0:0_4010f    147    chr1    134440    60    100M    =    134044    -496    ATCCAGCTCATTTCCATCCGGGTAGGAATTCCATCCGGGTAGAAATGCCATCCGGGTAGGATTTCCATCCGGGTAGAAATTCCATCCGGGTAGAAATTCC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:23    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100

samtools view tmp3.bam | head
chr1_16789_17289_0:0:0_0:0:0_f3e5e    163    chr1    16789    60    100M    =    17190    501    CCTTAAATTAACTTTAAGAAGTTTAGAATAAAGGAAAAATATCAAAACAAATTTACATAATCACCCACAAATTTATTTTTACAAAAATATTTCAAGCGTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_16789_17289_0:0:0_0:0:0_f3e5e    83    chr1    17190    60    100M    =    16789    -501    CTTTCATAATCGTAAGGAACCTTATGCGAACAAAAAATTACTCCTTTATTTGGAGTTATGAATTTTTATTCATGTAAACCCGAAAATCAAAAACCTGTTT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_32504_32881_0:0:0_0:0:0_dd53a    163    chr1    32504    60    100M    =    32782    378    GCTCTGTTTGCCGGCTCGCATTTCCGGTTCTGGATTGTTCTTGATTTTCTTTGCTTTGGGCATTTCGTGGAGGTGGGGATTTCAGGGTCTTGTAATTTTG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:1    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:1    XO:i:0    XG:i:0    MD:Z:86A13
chr1_32504_32881_0:0:0_0:0:0_dd53a    83    chr1    32782    60    100M    =    32504    -378    TCCTCTGCTTTTGTCCATATGGTTTCAATAGCTTGATATGGATCATGGTCCCCAAATCCCCATCACATAAAAACCTCTTTTTGTCGTGGTTTAGGTTTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_43520_44108_0:0:0_0:0:0_43024    163    chr1    43520    60    100M    =    44009    589    TTAGAAAATTAACCTGGAAAAAGGATGTCAAAATGCATGCGGTTATATAATTTCAGTTTAAATTTAAAATTTAATCATCAGACCAGATTTAAGGGAGAGT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_43520_44108_0:0:0_0:0:0_43024    83    chr1    44009    60    100M    =    43520    -589    CACGGACATGACTTATTCCATGACTGAGAAAAGGAACCAATTTAATTTGAACATCTTGACTGTGGTGTAAAATATTAATACTTCACGAGTGTTCACAATC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_110675_111174_0:0:0_0:0:0_48993    163    chr1    110675    60    100M    =    111075    500    CTATCTGCTCATCCACAGCAGACCTGTGTGCTTCAGTGCCCCACATTACACTGTAAATGAAGGGGTTTTTTTTGGCAATTTTGCAGAAAACCCAACGCAT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_110675_111174_0:0:0_0:0:0_48993    83    chr1    111075    60    100M    =    110675    -500    TAAAACTTCCGGTATCATAACATGAATGATGTTTGACACATCCATTAGTTTAACCATCCATCAATAATTCCAGCTAATGGTTGGCACTAAAATAACAAAA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_213719_214207_0:0:0_0:0:0_627bb    163    chr1    213719    60    100M    =    214108    489    TTTTAATTTTGTTCCTTTCATTCACTTTTCTTCACATCAATAAACATTGTTACCATAATGTGATCAAGGTGAAATTTTCGCTTATATATAAATCTTAAAC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_213719_214207_0:0:0_0:0:0_627bb    83    chr1    214108    60    100M    =    213719    -489    TTAAAAATATTTATAATTTCTTTTATGAAATGCAAATCAATGGCCTATTAGCTCAGTTGGTTAGAGCGTCGTGCTAATAACGCGAAGGTCGCAGGTTCGA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100

 

Interestingly, I noticed that if you use two digits random seeds you obtain different sample sizes (using the same probability).

 ls -l tmp*
-rwxrwx--- 1 marroni Domain Users 124458 Apr 15 11:52 tmp.bam
-rwxrwx--- 1 marroni Domain Users 124458 Apr 15 11:52 tmp2.bam
-rwxrwx--- 1 marroni Domain Users 119221 Apr 15 11:54 tmp3.bam

If you run samtools view -s on a file on which samtools view -s was already used you have unpredictable results

samtools view -b -s 1.5 tmp.bam > small.bam

samtools view -b -s 2.5 tmp.bam > small2.bam

-rwxrwx--- 1 marroni Domain Users   124458 Apr 15 12:18 small.bam
-rwxrwx--- 1 marroni Domain Users   124458 Apr 15 12:20 small2.bam
-rwxrwx--- 1 marroni Domain Users   124458 Apr 15 11:52 tmp.bam
-rwxrwx--- 1 marroni Domain Users   124458 Apr 15 11:52 tmp2.bam
-rwxrwx--- 1 marroni Domain Users   119221 Apr 15 11:54 tmp3.bam

 

I am currently not using samtools view -s.

 

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Fabio Marroni2.3k

Hi Fabio, 

I'm experiencing the same problem. You said you were not using samtools. 

Do you use picard DownsampleSam or some custom script to downsample a sam/bam file?

Thank you!

ADD REPLYlink written 4.6 years ago by biola10

In that situation I used a completely different approach. Since the data were simulated I just simulated fastq files of different sizes (I had to align all of them, but better than having wrong results...). 

DownsampleSam might be an option (I know that several people use it and they are happy with their results. I remember it being slow...)

The approaches suggested in the following answers are very good. I like the shuffle very much (you can remove the header and save it somewhere else and then add it back).

 

 

 

ADD REPLYlink written 4.6 years ago by Fabio Marroni2.3k

Hi!you said 1 is the seed, and what the seed means?

ADD REPLYlink written 4.6 years ago by 51278852220
5
gravatar for Sukhdeep Singh
7.3 years ago by
Sukhdeep Singh9.8k
Netherlands
Sukhdeep Singh9.8k wrote:

Hi, lets start using shell

###remove header and save the sam file head

sed 1,23d file.sam > file_noHeader.sam
head -n 23 file.sam > head

###randomly select one Million reads and save them (I took the one liner from here)

awk 'BEGIN {srand()} {printf "%05.0f %s \n",rand()*9999999, $0; }' file_noHeader.sam | sort -n | head - 1000000| sed 's/^[0-9]* //' > randomReads.tmp

###join the header back to have randomly sampled million reads subset of original file

cat head randomReads.tmp > randomReads.sam

###remove the tmp files

rm  file_noHeader.sam randomReads.tmp

Ofcourse it can be more efficient and automated using pipes.

Also, you can save the script in a file, and replace the file name with $1, like

awk 'BEGIN {srand()} {printf "%05.0f %s \n",rand()*9999999, $0; }' $1 | sort -n | head - 1000000| sed 's/^[0-9]* //' > $1.tmp

and then call it like sh rand.sh file.sam, it will produce file.tmp

Cheers

ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Sukhdeep Singh9.8k
2
gravatar for Wen.Huang
7.3 years ago by
Wen.Huang1.2k
Wen.Huang1.2k wrote:

how about

shuf your.sam | head -n 1000000
ADD COMMENTlink written 7.3 years ago by Wen.Huang1.2k
3

SAM files have a header section, as well as a body section, thus simply shuffling the file would mess up the overall structure. You could break the file into parts, as Sukhdeep suggests, and then shuf the body part.

ADD REPLYlink written 7.3 years ago by seidel6.8k

You can do like that from a BAM file:

samtools view -H your.bam > header.txt

samtools view your.bam > your.sam

shuf your.sam | head -n 1000000 > small.sam

(Because I don't know if you can pipe to shuf, maybe you can and we do not need to create the file small.sam) 

 

ADD REPLYlink written 4.6 years ago by Fabio Marroni2.3k

actually, I just found you can use:

shuf your.sam -n 1000000 > small.sam

The option to limit the output is already given by shuf.

ADD REPLYlink written 2.9 years ago by Fabio Marroni2.3k
2
gravatar for Matt Shirley
5.9 years ago by
Matt Shirley9.1k
Cambridge, MA
Matt Shirley9.1k wrote:

Have you taken a look at Picard DownsampleSam? You just set PROBABILITY=n and you should get near n * 100 % of the original number of reads.

ADD COMMENTlink written 5.9 years ago by Matt Shirley9.1k
0
gravatar for Istvan Albert
7.3 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

A simple workaround would be to sort your sam file by read names see the -n flag to samtools sort, then take the first, second etc 1 million to get a random sample without replacement.

In all honesty this approach most likely has some limitations since the read naming probably correlates to flow cell location that in turn may introduce some biases. I still think it could be useful and it is really easy to do.

ADD COMMENTlink written 7.3 years ago by Istvan Albert ♦♦ 81k
0
gravatar for swbarnes2
5.9 years ago by
swbarnes26.1k
United States
swbarnes26.1k wrote:

One more possibility; assuming you have Illumina reads, use grep to grep out just the reads from a particular tile or two, preferably tiles not on the edge. That subset would have a slightly higher quality than the whole dataset, which includes reads on the edge of the flowcell, but that's probably a good thing.

ADD COMMENTlink written 5.9 years ago by swbarnes26.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1391 users visited in the last hour