Question: split reads for different lanes in BAM files
0
gravatar for SOHAIL
3.1 years ago by
SOHAIL280
Beijing Institute of Genomics, CAS.
SOHAIL280 wrote:

Hi Everyone, I got per sample per aligned BAM files for already published genomes of human populations. In Header section i saw RG record something like this:

               @RG     ID:LP6005441-DNA_A09    SM:LP6005441-DNA_A09

having information for only RGID and RGSM. But to reproduce the results with GATK best practices, i have to correctly assign RG information.

Each bam I have represents a single sample from a single library prep but they were run on multiple lanes as indicated from the read information, e.g.:

               HS2000-630_102:4:2115:1889:70619
               HS2000-630_102:3:2311:13151:38215
              HS2000-630_102:2:2315:18670:41735

So. to correctly assign the RG information unique for group of reads for each lane, i want to split the per sample BAM files into multiple BAMs with respect to Flowcell lanes. so i can go through replacing the RG information and apply Markduplicates and BQSR procedures correctly.

I am new in this, Could you please suggest any tool or script in order to do my job?

Thanks in advance!

ngs • 1.3k views
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by SOHAIL280

Hi Pierre, Does this work on BAM files?

ADD REPLYlink written 3.1 years ago by SOHAIL280

yes, and it writes bam too.

ADD REPLYlink written 3.1 years ago by Pierre Lindenbaum122k
0
gravatar for Pierre Lindenbaum
3.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

I wrote a tool for a similar job in Advice On Adding Readgroups

see https://github.com/lindenb/jvarkit/wiki/Biostar78400

$ cat input.sam 
@SQ SN:ref  LN:45
@SQ SN:ref2 LN:40
HS2000-1259_127:1:1210:15640:52255  163 ref 7   30  8M4I4M1D3M  =   37  39  
TTAGATAAAGAGGATACTG *   XX:B:S,12561,2,20,112
HS2000-1259_128:2:1210:15640:52255  0   ref 9   30  1S2I6M1P1I1P1I4M2I  *   0   
0   AAAAGATAAGGGATAAA   *

$java -jar dist/biostar78400.jar \
    -x groups.xml \
    input.sam \


@HD VN:1.4  SO:unsorted
@SQ SN:ref  LN:45
@SQ SN:ref2 LN:40
@RG ID:X1   PL:P1   PU:P1   LB:L1   DS:blabla   SM:S1   CN:C1
@RG ID:x2   PL:P2   PU:P2   LB:L2   DS:blabla   SM:S2   CN:C1
HS2000-1259_127:1:1210:15640:52255  163 ref 7   30  8M4I4M1D3M  =   37  39  TTAGATAAAGAGGATACTG *   RG:Z:X1 XX:B:S,12561,2,20,112
HS2000-1259_128:2:1210:15640:52255  0   ref 9   30  1S2I6M1P1I1P1I4M2I  *   0   0AAAAGATAAGGGATAAA  *   RG:Z:x2
ADD COMMENTlink written 3.1 years ago by Pierre Lindenbaum122k
0
gravatar for SOHAIL
3.1 years ago by
SOHAIL280
Beijing Institute of Genomics, CAS.
SOHAIL280 wrote:

Problem solved! Thanks Pierre for your support at GitHub and here as well. :)

ADD COMMENTlink written 3.1 years ago by SOHAIL280
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1703 users visited in the last hour