Question: Split BAM file from more sample
0
gravatar for martyferr90
5 days ago by
martyferr9030
martyferr9030 wrote:

Hi all! I have a little problem: I have 1 bam file (44gb ca) ant it contain the reads from 11 different sample. I have 2 txt file with sample name and a lot tab delimited number.

How can I split this unique BAM file into 11 different bam files?

Is it correct to use the following code? samtools view -bhR readids_for_sample_A.txt File.bam > File_A.bam

split bam split bam sample • 167 views
ADD COMMENTlink modified 4 days ago • written 5 days ago by martyferr9030

Do you have @RG tags in your BAM? See: Split a multisample bam using RG tag information

ADD REPLYlink written 5 days ago by genomax39k

I'm not sure that my txt file contain the RG tags, because in my files there are this information:

sample1.sorted.bam    sample2.sorted.bam (and other 9)
14578326    10905678    9856227 14119725    12330675    1395283512191130    13570563    43751694    6531804 10925343    
14551023    10883187    9835887 14095128    12308196    1392150612160806    13543377    43670814    6517218 10904049    
1176495 887865  802050  1150821 1008891 1140798 988629  1106853 3577497 529143  882285  
1236009 929736  839184  1204938 1046034 1187097 1032639 1157844 3716544 559122  929523  
1025331 762015  688068  988872  859437  982050  851715  952365  3046440 464532  763785  
944853  701523  631413  901983  785562  897543  778584  864981  2760462 423342  700155  
912696  683217  619527  883587  767157  875331  761220  852915  2718768 413766  686181  
878742  657891  596466  852810  743709  848145  737712  819984  2622840 401040  663999  
766929  569241  515904  740352  644226  736128  638103  712884  2287110 341982  574176

Any idea? Could work like RG tags? How can I produce 11 separated files?

ADD REPLYlink modified 4 days ago by genomax39k • written 4 days ago by martyferr9030

could you do a samtools view input.bam |head -1 and post the results?

ADD REPLYlink written 4 days ago by Gabriel R.2.2k

this is the output

L7IZC:01332:11594   4   *   0   0   *   *   0   0TAGAGAGTACGATCTCAGGTTTCAGGGTTATTTGACTACTACCTAGCTCAAGTCTTGAGCCACCATTACGTGTGCTAGAAAGGGTTACTAACCTCTGCCGAAGGGCTATAATGCTTACTGTAGAATTCTACTTGTCTATAGGATAAAGCATGATAATGGATGGTGGTAATTGCTCA   <;;<?<::::;;;;;?>>6==4<;;;2=5:;;/98::;;;7/*//////7499::7<=;<655.486899:88//.7755/59/616955.56;;:::5996::39998:6:=985::99998848599::;7<<<<<;;:7;<@@3///:::<A5<<5;;;5;:578/6278744    ZP:B:f,0.0106096,0.00442751,0.00104526  ZG:i:317    ZB:i:30 ZC:B:i,317,317,1,0  ZA:i:176    ZM:B:s,318,-8,300,-16,-14,306,-4,268,260,24,30,34,284,254,28,262,50,-14,272,228,64,-2,208,80,200,-14,-2,220,250,10,236,16,-12,240,4,22,222,-2,288,-20,250,246,44,2,242,478,-18,50,736,242,10,224,40,32,746,4,450,36,10,270,16,-16,6,-2,768,48,10,180,24,214,208,8,242,-16,8,-14,220,-14,218,-2,266,2,8,248,62,560,26,-2,242,-2,18,290,36,14,134,160,174,44,198,2,-10,500,-8,220,240,222,462,240,282,252,478,244,52,538,12,168,466,32,46,236,-2,52,26,28,326,36,238,84,184,72,6,216,278,2,74,222,16,306,196,32,200,204,12,684,14,10,646,82,372,46,-2,254,16,34,30,36,254,18,48,-14,288,560,446,84,268,126,236,8,258,2,-20,184,102,34,486,12,84,26,176,464,46,42,640,54,74,36,26,42,302,-12,-4,6,242,334,46,90,248,414,104,26,232,14,42,182,82,114,206,120,460,30,28,228,50,192,40,26,228,186,192,206,32,-10,178,28,112,454,94,34,414,52,202,76,280,0,36,18,272,52,210,40,488,58,184,6,212,202,42,84,246,32,12,258,48,22,88,10,240,264,60,372,84,244,14,54,202,26,32,-16,624,186,98,178,276,26,162,274,234,-22,94,444,270,458,6,276,-4,30,100,50,208,108,44,426,260,28,6,396,214,18,8,28,356,62,-16,84,488,42,204,144,134,236,148,74,214,154,44,40,196,292,72,2,282,94,202,32,108,212,450,206,88,112,22,28,334    ZF:i:8  RG:Z:L7IZC  PG:Z:bc
ADD REPLYlink modified 4 days ago by genomax39k • written 4 days ago by martyferr9030

ok, so for this read, the read group is l7izc? How come rg is not capitalized? How was demultiplexing done?

ADD REPLYlink modified 4 days ago • written 4 days ago by Gabriel R.2.2k

I tryed to convert my bam file into a sam file to understand something. The complete head is:

@HD VN:1.4  GO:none SO:coordinate
@RG ID:L7IZC    PL:IONTORRENT   PU:s5/540   FO:TACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGADS:IONA Test S5 Run  for use with Traceability Worksheet DT:2017-12-02T15:32:17+0100 SM:Sample_1 KS:TCAG CN:S5TorrentServerVM/S5-0318
@RG ID:L7IZC.1  PL:IONTORRENT   PU:s5/540   FO:TACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGADS:IONA Test S5 Run  for use with Traceability Worksheet DT:2017-12-02T16:17:34+0100 SM:Sample_1 KS:TCAG CN:S5TorrentServerVM/S5-0318
@RG ID:L7IZC.10 PL:IONTORRENT   PU:s5/540   FO:TACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGADS:IONA Test S5 Run  for use with Traceability Worksheet DT:2017-12-02T21:44:54+0100 SM:Sample_1 KS:TCAG CN:S5TorrentServerVM/S5-0318

[...]

So, I could understand that I have 94 (from complete sam file) distinct ID, from 1 sample (SM, right?) but I know that I have only 11 sample. Am I right?

ADD REPLYlink written 4 days ago by martyferr9030

Are those chromosomes in the ID by any chance? i.e. samples split into chromosomes?

ADD REPLYlink written 4 days ago by genomax39k

If so, should I not have 24*11 IDs?

ADD REPLYlink written 4 days ago by martyferr9030

So one would think. Have you looked through the collection of 94 to see if there is a pattern consistent with all?

ADD REPLYlink written 4 days ago by genomax39k

every RG is like this:

@RG ID:L7IZC.1  PL:IONTORRENT   PU:s5/540   FO:TACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGA DS:IONA Test S5 Run  for use with Traceability WorksheetDT:2017-12-02T16:17:34+0100 SM:Sample_1 KS:TCAG`    CN:S5TorrentServerVM/S5-0318

the only thing that change is the ID. HD line is:

@HD VN:1.4  GO:none SO:coordinate

Any idea?

ADD REPLYlink modified 4 days ago • written 4 days ago by martyferr9030

Then the program I mentioned should work and give you 1 bam file per sample while going through the bam file once.

ADD REPLYlink written 4 days ago by Gabriel R.2.2k

Ok, now I have 94 bam files, but I have 11 sample, any idea to how can I have 1 file for sample?

ADD REPLYlink modified 4 days ago • written 4 days ago by martyferr9030

the program creates one file per RG in the header, can you do an 'ls -al' in your directory?

ADD REPLYlink written 4 days ago by Gabriel R.2.2k

This is the 'ls -al' output.

-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.10.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.11.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.12.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.13.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.9.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.A.bam
-rw-r--r-- 1 user user 46076502142 dic  7 16:49 out.L7IZC.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.B.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.C.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.D.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.E.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.F.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.G.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.H.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.I.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.J.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.K.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.L.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.M.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.N.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.O.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.P.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.Q.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.R.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.S.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.T.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.U.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.V.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.W.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.X.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.Y.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.Z.bam
-rw-r--r-- 1 user user 46076511182 dic  6 10:25 R_2017_12_02_08_20_47_user_S5-0318-32-IONA_Test_S5_-_Traceability_Worksheet_-01DIC2017_Sample_1_Auto_user_S5-0318-32-IONA_Test_S5_-_Traceability_Worksheet_-01DIC2017_Sample_1_183.basecaller.bam
-rw-r--r-- 1 user user 25809747968 dic  7 11:44 R_2017_12_02_08_20_47_user_S5-0318-32-IONA_Test_S5_-_Traceability_Worksheet_-01DIC2017_Sample_1_Auto_user_S5-0318-32-IONA_Test_S5_-_Traceability_Worksheet_-01DIC2017_Sample_1_183.basecaller.sam
ADD REPLYlink written 4 days ago by martyferr9030

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

Use Submit Answers only for new answers to original question.

ADD REPLYlink modified 4 days ago • written 4 days ago by genomax39k

Those do not look like Chromosome names after all and it does not look like the samples were split.

ADD REPLYlink modified 4 days ago • written 4 days ago by genomax39k

Yep, you're right! It a strange output, I have 1 big bam file, and the others are very small, but I have just this informations, there aren't any process that I can do for split this file??

ADD REPLYlink written 4 days ago by martyferr9030

Was this data produced by Torrent Suite? Perhaps you can export individual samples from there?

ADD REPLYlink written 4 days ago by genomax39k

Yes, this data was produce by Torrent Suite. Honestly I don't know, I only know that this file is from a specific analysis with a specific workflow. They ask to me if I could analyze this file, but for my analysis I need 1 file per sample, I tried to analyze the entire unique file, but I can't. Maybe is there a script, R package or similar for analyze a file like this? In particular for aneuplody research.

ADD REPLYlink written 3 days ago by martyferr9030
0
gravatar for Hussain Ather
5 days ago by
Hussain Ather500
National Institutes of Health, Bethesda, MD
Hussain Ather500 wrote:

What you've written should work.

ADD COMMENTlink written 5 days ago by Hussain Ather500
0
gravatar for Gabriel R.
5 days ago by
Gabriel R.2.2k
Center for Geogenetik Københavns Universitet
Gabriel R.2.2k wrote:

if you want one bam file per sample and have RG tags, you can use my little program here: https://github.com/grenaud/libbam/blob/master/splitByRG.cpp

otherwise, you can just iterate over each RG using samtools view -r [rg] but then you go over each record 11 times.

ADD COMMENTlink written 5 days ago by Gabriel R.2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1497 users visited in the last hour