Question: Preprocessing The Bam For Gatk
2
gravatar for Pierre Lindenbaum
8.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum128k wrote:

The GATK often requires that the BAM contains some extra informations in the header ( Group ID, Group Library, SO:coordinate...).

Which commands I should invoke after bwa so my SAM/BAM files (1 sample/file) are valid for the GATK ?

For example, I know I can use picard AddOrReplaceReadGroups to fix some fields in the SAM/header, but I would rather like to insert a awk command in the pipeline after bwa sampe.

Thanks

gatk bam sam • 4.2k views
ADD COMMENTlink written 8.2 years ago by Pierre Lindenbaum128k
1

Don't you trust the picard tools? They usually do a good job! Or do you want to speed up things?

ADD REPLYlink written 8.2 years ago by Christof Winter990

AddOrReplaceReadGroups just adds an header isn't it ? why should I use it when I can insert some headers on the fly ?

ADD REPLYlink written 8.2 years ago by Pierre Lindenbaum128k

If I am not mistaken, you would need to change the RG tag of each read as well.

ADD REPLYlink written 8.2 years ago by Christof Winter990
4
gravatar for lh3
8.2 years ago by
lh332k
United States
lh332k wrote:

Try:

bwa sampe -r '@RG\tID:foo\tSM:bar\tLB:foobar'

For "SO:coordinate", you may sort with Picard, or sort with samtools and reheader later.

ADD COMMENTlink written 8.2 years ago by lh332k

Hi, is actually workimg the '-r' option in the sampe command ? My cmdline is like this: bwa sampe -P ucsc.hg19 -r '@RG\tID:foo\tSM:bar' $FASTQPATH"MAP1.sai" $FASTQPATH"MAP2.sai" $TRIM1fq $TRIM2fq > $FASTQPATH"MAP.sam" but I always get: sampe: invalid option -- 'r'

ADD REPLYlink written 7.2 years ago by ff.cc.cc1.3k
1

Use 0.6.2/0.5.10.

ADD REPLYlink written 7.2 years ago by lh332k

thanks ! (+1) '-r' will be supported in future ?

ADD REPLYlink written 7.2 years ago by ff.cc.cc1.3k
2
gravatar for Rok
8.2 years ago by
Rok190
Trondheim, Norway
Rok190 wrote:

Using Picard is a good idea, the only problem it seems you need to create some temporary files since it does not support using unix pipes.

One of the possibilities is also writing a script to do it, but it is going to be more complex than just one liner in awk. It also depends on how you want to sort your reads, do you want all the reads in the same read group, or do you want to add multiple groups based on additional information (lane, clusters). Adding such information can prove useful in downstream analysis if you're using Genome Analysis Toolkit for quality score recalibration or something like that.

For the script part you first need to add an additional line to the header of the file. This line describes the read group like this:

@RG ID:R-0  PL:Illumina PU:0    LB:R    SM:GM12878_GABP

Than for each read belonging to this group you need to add to the end of line (be careful, cells are tab delimited):

RG:Z:R-0

Some easy script for this probably already exists somewhere, but also rewriting it anew shouldn't be a big difficulty.

ADD COMMENTlink written 8.2 years ago by Rok190
1

FYI: Most picard modules do support pipes: use /dev/stdin and /dev/stdout. See http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_Can_Picard_programs_read_from_stdin_and_write_to_stdout.3F

ADD REPLYlink written 8.0 years ago by Andreas2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1521 users visited in the last hour