Question

How to extract certain tag value by key from sam/bam file?

0

Entering edit mode

5.7 years ago

yech1990 ▴ 30

One line of the input file:

m54071222/4194368/0_197      4       *       0       255     *       *       0       0       AAGAGGAAGGGGGAGAGAGAGGAGGAGAGGGGGGAAGAGGTTGGGATGGAAAATAGGTGGTTAGAGGGAGAAAGG   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!         np:i:1   qe:i:197        qs:i:0  rq:f:0  BS:Z:ATGGCCAATTGCAGAA    BQ:Z:JJJKKKKKKKKLLLLL  zm:i:4194368    RG:Z:1f1bf15c   sc:A:L  sz:A:N

I want to extract certain tag (BS and BQ in this example) in each read of sam file and pass it to a new file.

I can do that with pysam. But is it possible to do this job with one line linux command? Tool that accept stdin and stdout operating will be better.

There is build in method getTag in bamtools, however, it is only for filtering reads.

The expect output:

@m54071222/4194368/0_197
ATGGCCAATTGCAGAA 
+
JJJKKKKKKKKLLLLL

sam awk samtools linux bash • 9.9k views

ADD COMMENT • link updated 5.7 years ago by Pierre Lindenbaum 161k • written 5.7 years ago by yech1990 ▴ 30

score 2 · Answer 1 · 2018-09-01

2

Entering edit mode

5.7 years ago

Pierre Lindenbaum 161k

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar dist/bioalcidaejdk.jar -e \
 'stream().forEach(R->println("@"+R.getReadName()+"\n"+R.getAttribute("BS")+"\n+\n"+R.getAttribute("BQ")));' in.bam

ADD COMMENT • link 5.7 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Wow, that is a powerful tool. Although it looks like a one-line python command that wrap pysam...

ADD REPLY • link 5.7 years ago by yech1990 ▴ 30

1

Entering edit mode

that wrap pysam...

it wraps htsjdk

ADD REPLY • link 5.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

still not perfect... If the tag is not exist, the output will but "null".

ADD REPLY • link 5.1 years ago by yech1990 ▴ 30

score 1 · Answer 2 · 2018-09-01

An awk solution:

samtools view test.bam | awk '{for (i=12; i<=NF; ++i) { if ($i ~ "^BS:|^BQ:"){ split($i, tc, ":"); td[tc[1]] = tc[3]; } }; print "@"$1"\n"td["BS"]"\n+\n"td["BQ"] }'

edited: more robust way:

awk '{for (i=12; i<=NF; ++i) { if ($i ~ "^US:Z:|^UQ:Z:"){ td[substr($i,1,2)] = substr($i,6,length($i)-5); } }; print "@"$1"\n"td["US"]"\n+\n"td["UQ"] }'

score 1 · Answer 3 · 2018-09-01

From each line, a fastq record is created and each fastq record is separated by double line. You can remove "\n\n" if you don't want it.

output:

$ awk -v OFS="\n" -v ORS="\n\n" '{gsub("[BQ:Z|BQ:S]","",$0); {print "@"$1,"+",$16,$17}}' test.sam 

@m54071222/4194368/0_197
+
ATGGCCAATTGCAGAA
JJJKKKKKKKKLLLLL

@m54071222/4194368/0_197
+
ATGGCCAATTGCAGAA
JJJKKKKKKKKLLLLL

Input (for testing, I duplicated the lines):

$ cat test.sam 

m54071222/4194368/0_197 4   *   0   255 *   *   0   0   AAGAGGAAGGGGGAGAGAGAGGAGGAGAGGGGGGAAGAGGTTGGGATGGAAAATAGGTGGTTAGAGGGAGAAAGG !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!   np:i:1  qe:i:197    qs:i:0  rq:f:0  BS:Z:ATGGCCAATTGCAGAA   BQ:Z:JJJKKKKKKKKLLLLL   zm:i:4194368    RG:Z:1f1bf15c   sc:A:L  sz:A:N
m54071222/4194368/0_197 4   *   0   255 *   *   0   0   AAGAGGAAGGGGGAGAGAGAGGAGGAGAGGGGGGAAGAGGTTGGGATGGAAAATAGGTGGTTAGAGGGAGAAAGG !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!   np:i:1  qe:i:197    qs:i:0  rq:f:0  BS:Z:ATGGCCAATTGCAGAA   BQ:Z:JJJKKKKKKKKLLLLL   zm:i:4194368    RG:Z:1f1bf15c   sc:A:L  sz:A:N

However I suggest you to move + to 2nd line and sequence to 3rd line so that the output in standard (almost) fastq format. This would help you in running fastq stats on output. For eg.

$ awk -v OFS="\n" -v ORS="\n\n" '{gsub("[BQ:Z|BQ:S]","",$0); {print "@"$1,$16,"+",$17}}' test.sam | seqkit stats

file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTQ   DNA          2       32       16       16       16

$ awk -v OFS="\n" -v ORS="\n\n" '{gsub("[BQ:Z|BQ:S]","",$0); {print "@"$1,$16,"+",$17}}' test.sam

@m54071222/4194368/0_197
ATGGCCAATTGCAGAA
+
JJJKKKKKKKKLLLLL

@m54071222/4194368/0_197
ATGGCCAATTGCAGAA
+
JJJKKKKKKKKLLLLL