Preprocessing The Bam For Gatk
2
2
Entering edit mode
10.6 years ago

The GATK often requires that the BAM contains some extra informations in the header ( Group ID, Group Library, SO:coordinate...).

Which commands I should invoke after bwa so my SAM/BAM files (1 sample/file) are valid for the GATK ?

For example, I know I can use picard AddOrReplaceReadGroups to fix some fields in the SAM/header, but I would rather like to insert a awk command in the pipeline after bwa sampe.

Thanks

gatk sam bam • 5.3k views
1
Entering edit mode

Don't you trust the picard tools? They usually do a good job! Or do you want to speed up things?

0
Entering edit mode

0
Entering edit mode

If I am not mistaken, you would need to change the RG tag of each read as well.

4
Entering edit mode
10.6 years ago
lh3 33k

Try:

bwa sampe -r '@RG\tID:foo\tSM:bar\tLB:foobar'


For "SO:coordinate", you may sort with Picard, or sort with samtools and reheader later.

0
Entering edit mode

Hi, is actually workimg the '-r' option in the sampe command ? My cmdline is like this: bwa sampe -P ucsc.hg19 -r '@RG\tID:foo\tSM:bar' $FASTQPATH"MAP1.sai"$FASTQPATH"MAP2.sai" $TRIM1fq$TRIM2fq > \$FASTQPATH"MAP.sam" but I always get: sampe: invalid option -- 'r'

1
Entering edit mode

Use 0.6.2/0.5.10.

0
Entering edit mode

thanks ! (+1) '-r' will be supported in future ?

2
Entering edit mode
10.6 years ago
Rok ▴ 190

Using Picard is a good idea, the only problem it seems you need to create some temporary files since it does not support using unix pipes.

One of the possibilities is also writing a script to do it, but it is going to be more complex than just one liner in awk. It also depends on how you want to sort your reads, do you want all the reads in the same read group, or do you want to add multiple groups based on additional information (lane, clusters). Adding such information can prove useful in downstream analysis if you're using Genome Analysis Toolkit for quality score recalibration or something like that.

For the script part you first need to add an additional line to the header of the file. This line describes the read group like this:

@RG ID:R-0  PL:Illumina PU:0    LB:R    SM:GM12878_GABP


Than for each read belonging to this group you need to add to the end of line (be careful, cells are tab delimited):

RG:Z:R-0


Some easy script for this probably already exists somewhere, but also rewriting it anew shouldn't be a big difficulty.

1
Entering edit mode