Question

extracting data from single end aligned bam file - chromosome, strand, start and stop from cigar

0

Entering edit mode

11 months ago

Agamemnon ▴ 80

Hello,

I have aligned reads against a reference genome, I have now extracted the unique reads in a bam file, I wanted to ask if there is a program that allows me to extract the name of the read, the start and stop position, as well as the strandedness and the chromosome against which the read aligns?

thanks

bam • 946 views

ADD COMMENT • link updated 11 months ago by Istvan Albert 100k • written 11 months ago by Agamemnon ▴ 80

0

Entering edit mode

Please read the SAM specifications. SAM/BAM is essentially a tab-delimited format, and the documentation will tell you in which columns you find the information. Convert BAM to SAM with samtools view and use any Unix tool (awk, sed, cut) to then select the relevant columns. Can be done via a pipe efficiently.

ADD REPLY • link 11 months ago by ATpoint 81k

0

Entering edit mode

the problem is the stop coordinate, it is not all that easy to get that information

it is one of those endemic bioinformatics things where a commonly needed information like start/stop of an alignment ought to be present ... but it isn't there

BTW I have asked the ChatGPT and lo behold it incorrectly claims that the following works:

Here is an example command that will extract the read name, start and stop position, strandedness, and chromosome for each read in a BAM file called "aligned_reads.bam":
samtools view -h aligned_reads.bam | \
     awk '$1 !~ /^@/{print $1 "\t" $3 "\t" $4 "\t" $4+length($10)-1 "\t" ($2 & 16 ? "-" : "+")}' 
This command will output a tab-separated file with the following columns:

Read name Chromosome Start position (1-based) Stop position (1-based, inclusive of the last base in the read) Strandedness (+ or -)

but the answer is not correct,

the stop coordinate can only be computed by parsing the CIGAR string and moving the position for every M and D operator but not on S, H or I operators

ADD REPLY • link 11 months ago by Istvan Albert 100k

score 3 · Accepted Answer · 2023-05-09

3

Entering edit mode

11 months ago

Istvan Albert 100k

What you could do is use bedtools bamtobed to format the BAM file as a BED file with start/stop coordinates.

the problem still remains that the read name is not present

you might be able to use the UNIX paste command to join the first column of the SAM file over the BED file,

ADD COMMENT • link 11 months ago by Istvan Albert 100k

1

Entering edit mode

the problem still remains that the read name is not present

I can see the read name in column 4 of a file I just converted using bamtobed.

chrIV   462211  462361  M12345:768:000000000-JHFDF:1:2103:22666:7676 1:N:0:ACCGATTA+GCCGTGGC/1  45      +
chrIV   462211  462361  M12345:768:000000000-JHFDF:1:1109:21710:12608 1:N:0:ACCGATTA+GCCGTGGG/1 45      -
chrIV   462212  462362  M12345:768:000000000-JHFDF:1:1110:3933:9109 1:N:0:ACCGATTA+GCCGTGGC/1   45      +
chrIV   462212  462362  M12345:768:000000000-JHFDF:1:1113:17079:20757 1:N:0:ACCGATTA+GCCGTGGC/1 45      +
chrIV   462212  462362  M12345:768:000000000-JHFDF:1:2107:13009:28070 1:N:0:ACCGATTA+GCCGTGGC/2 45      +
chrIV   462215  462365  M12345:768:000000000-JHFDF:1:2101:9787:11405 1:N:0:ACCGATTA+GCCGTGGC/1  45      -
chrIV   462216  462366  M12345:768:000000000-JHFDF:1:1106:20844:11556 1:N:0:ACCGATTA+GCCGTGGC/1 45      +

So perhaps following would work.

$ bedtools bamtobed -i test_sorted.bam | awk -F "\t" '{OFS="\t"}{print $4,$1,$2,$3,$6}' 
M12345:768:000000000-JHFDF:1:2112:13619:12220 1:N:0:ACCGATTA+GCCGTGGC/2 chrII   321102  321252  +
M12345:768:000000000-JHFDF:1:2110:13390:8727 1:N:0:ACCGATTA+GCCGTGGC/2  chrII   435977  436127  +
M12345:768:000000000-JHFDF:1:2114:21240:13267 1:N:0:ACCGATTA+GCCGTGGC/2 chrIV   132378  132528  -

ADD REPLY • link 11 months ago by GenoMax 141k

0

Entering edit mode

Thanks that seems to help, I may have to join the output of here with the bam file to also get the sequence information.

ADD REPLY • link 11 months ago by Agamemnon ▴ 80

0

Entering edit mode

oh, sounds like I remembered it incorrectly,

sounds good, then it is a full solution

ADD REPLY • link 11 months ago by Istvan Albert 100k

score 2 · Accepted Answer · 2023-05-09

In my final solution, I used the original bam file, and a sam file to generate the file I needed.

my code:

bedtools bamtobed -i input.bam| 
awk '{str = ($6 == "+") ? "pos" : "neg"; print $4, $2, $3, $1, str}' OFS='\t' |
awk 'FNR==NR{sam[$1]=$10; next} {print $0, sam[$1]}' input.sam - |
awk '{print $1, $4, $5, $6, $2, $3}' |
awk 'BEGIN{FS=OFS="\t"} {gsub(/[:;-]/,"\t",$1); print }' |
awk '{gsub("pos", "+1"); gsub("neg", "-1"); print}' > output.file