Preprocessing The Bam For Gatk
2
2
Entering edit mode
9.4 years ago

The GATK often requires that the BAM contains some extra informations in the header ( Group ID, Group Library, SO:coordinate...).

Which commands I should invoke after bwa so my SAM/BAM files (1 sample/file) are valid for the GATK ?

For example, I know I can use picard AddOrReplaceReadGroups to fix some fields in the SAM/header, but I would rather like to insert a awk command in the pipeline after bwa sampe.

Thanks

gatk sam bam • 4.6k views
ADD COMMENT
1
Entering edit mode

Don't you trust the picard tools? They usually do a good job! Or do you want to speed up things?

ADD REPLY
0
Entering edit mode

AddOrReplaceReadGroups just adds an header isn't it ? why should I use it when I can insert some headers on the fly ?

ADD REPLY
0
Entering edit mode

If I am not mistaken, you would need to change the RG tag of each read as well.

ADD REPLY
4
Entering edit mode
9.4 years ago
lh3 32k

Try:

bwa sampe -r '@RG\tID:foo\tSM:bar\tLB:foobar'

For "SO:coordinate", you may sort with Picard, or sort with samtools and reheader later.

ADD COMMENT
0
Entering edit mode

Hi, is actually workimg the '-r' option in the sampe command ? My cmdline is like this: bwa sampe -P ucsc.hg19 -r '@RG\tID:foo\tSM:bar' $FASTQPATH"MAP1.sai" $FASTQPATH"MAP2.sai" $TRIM1fq $TRIM2fq > $FASTQPATH"MAP.sam" but I always get: sampe: invalid option -- 'r'

ADD REPLY
1
Entering edit mode

Use 0.6.2/0.5.10.

ADD REPLY
0
Entering edit mode

thanks ! (+1) '-r' will be supported in future ?

ADD REPLY
2
Entering edit mode
9.4 years ago
Rok ▴ 190

Using Picard is a good idea, the only problem it seems you need to create some temporary files since it does not support using unix pipes.

One of the possibilities is also writing a script to do it, but it is going to be more complex than just one liner in awk. It also depends on how you want to sort your reads, do you want all the reads in the same read group, or do you want to add multiple groups based on additional information (lane, clusters). Adding such information can prove useful in downstream analysis if you're using Genome Analysis Toolkit for quality score recalibration or something like that.

For the script part you first need to add an additional line to the header of the file. This line describes the read group like this:

@RG ID:R-0  PL:Illumina PU:0    LB:R    SM:GM12878_GABP

Than for each read belonging to this group you need to add to the end of line (be careful, cells are tab delimited):

RG:Z:R-0

Some easy script for this probably already exists somewhere, but also rewriting it anew shouldn't be a big difficulty.

ADD COMMENT
1
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 1383 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6