Modify chromosome column in bam file using pysam fetch
1
0
Entering edit mode
2.0 years ago
Linda ▴ 60

I'm trying to modify the header and chromosome to include the file prefix and output this to a new bam file. I was going to do this with sed but would rather do it with pysam if it is possible.

A line of my bam file is as follows:

GWNJ-0901:658:GW2006263225th:6:1101:12824:2610  99      NC_011993.1_Escherichia_coli_LF82_complete_genome_length_4773108        2056740 42      27M     =       2056869 150     CGGCTGCACGGGCGAAGTTTCCGCCGC     FJ-AJAJJJ<AAFJJJ<J<JFJ<-A7A    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:27 YS:i:-3 YT:Z:CP


Where I want to access the 3rd column, chromosome colum I think it is called, and concat a prefix to it:

prefix_NC_011993.1_Escherichia_coli_LF82_complete_genome_length_4773108


I'm having trouble writing out this amended column to an output file using fetch:

                for read in input_bam.fetch(reference=species):

with pysam.AlignmentFile(full_output_path, "w",template=input_bam) as outf:
a = pysam.AlignedSegment()
outf.write(a)

SyntaxError: cannot assign to function call


How can I write this amended chromosome column to an output file using pysam? Many thanks!

pysam • 1.4k views
0
Entering edit mode

Linda : This is a specific question so don't use Forum tag. Forum posts are generally open-ended discussions that may have more than one point of view.

0
Entering edit mode

My bad. This is my first question. Hopefully it's updated now.

0
Entering edit mode
1
Entering edit mode
2.0 years ago
David Parry ▴ 120

The reason you receive that SyntaxError is because you are using a.get_reference_name(read.reference_id) = prefixed_chrom when you really want to be using a.reference_name = prefixed_chrom. Also, get_reference_name method is a method of the AlignmentFile or AlignmentHeader objects, not the AlignedSegment.

Another issue with your code is that you are opening your output file everytime you parse a read from your input file, so you will end up continuosly overwriting your output file resulting in a final BAM with only one read.

Finally, if you want to use pysam for this you will have to define the name of all your new contigs in the header of your output file before you create any alignments. Otherwise pysam will refuse to write a read with an unknown contig name.

The code below should approximate what you want to do:

# edit the sequence names for your output header
prefix = 'prefix'
seq['SN'] = prefix + '_' + seq['SN']
# create output BAM with newly defined header
prefixed_chrom = prefix + '_' + read.reference_name
a.reference_name = prefixed_chrom
outf.write(a)

1
Entering edit mode

It works thank you so much! I was silly to think it would be more efficient than a sed solution but at least I know it's possible with pysam now. Your dictionary header code should be included in pysam's documentation for writing to files.

0
Entering edit mode

Great - glad it works! While it needs more code than sed hopefully it should be faster and more CPU efficient than decompressing, piping to sed and recompressing large BAMs. Also, using pysam means you're less likely to inadvertently bork your output files, so hopefully worth the extra code.

Yeah, the lack of documentation does make working with pysam a bit of an ordeal sometimes, but I guess noone likes writing docs.