Question

Modify chromosome column in bam file using pysam fetch

0

Entering edit mode

3.1 years ago

Linda ▴ 80

I'm trying to modify the header and chromosome to include the file prefix and output this to a new bam file. I was going to do this with sed but would rather do it with pysam if it is possible.

A line of my bam file is as follows:

GWNJ-0901:658:GW2006263225th:6:1101:12824:2610  99      NC_011993.1_Escherichia_coli_LF82_complete_genome_length_4773108        2056740 42      27M     =       2056869 150     CGGCTGCACGGGCGAAGTTTCCGCCGC     FJ-AJAJJJ<AAFJJJ<J<JFJ<-A7A    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:27 YS:i:-3 YT:Z:CP

Where I want to access the 3rd column, chromosome colum I think it is called, and concat a prefix to it:

prefix_NC_011993.1_Escherichia_coli_LF82_complete_genome_length_4773108

I'm having trouble writing out this amended column to an output file using fetch:

                for read in input_bam.fetch(reference=species):
                    print(input_bam.get_reference_name(read.reference_id)) # Returns chromosome column
                    prefixed_chrom=prefix + '_' +input_bam.get_reference_name(read.reference_id)

                    with pysam.AlignmentFile(full_output_path, "w",template=input_bam) as outf:
                        a = pysam.AlignedSegment()
                        a.query_name = read.query_name
                        a.query_sequence = read.query_sequence
                        a.get_reference_name(read.reference_id) = prefixed_chrom
                        a.flag = read.flag
                        a.reference_id = read.reference_id
                        a.reference_start = read.reference_start
                        a.mapping_quality = read.mapping_quality
                        a.cigar = read.cigar
                        a.next_reference_id = read.next_reference_id
                        a.next_reference_start= read.next_reference_start
                        a.template_length=read.template_length
                        a.query_qualities = read.query_qualities
                        a.tags = read.tags
                        outf.write(a)


SyntaxError: cannot assign to function call

How can I write this amended chromosome column to an output file using pysam? Many thanks!

pysam • 2.1k views

ADD COMMENT • link updated 3.1 years ago by David Parry ▴ 130 • written 3.1 years ago by Linda ▴ 80

0

Entering edit mode

Linda : This is a specific question so don't use Forum tag. Forum posts are generally open-ended discussions that may have more than one point of view.

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

My bad. This is my first question. Hopefully it's updated now.

ADD REPLY • link 3.1 years ago by Linda ▴ 80

0

Entering edit mode

see also Bam File: Change Chromosome Notation

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 161k

score 1 · Accepted Answer · 2021-03-25

The reason you receive that SyntaxError is because you are using a.get_reference_name(read.reference_id) = prefixed_chrom when you really want to be using a.reference_name = prefixed_chrom. Also, get_reference_name method is a method of the AlignmentFile or AlignmentHeader objects, not the AlignedSegment.

Another issue with your code is that you are opening your output file everytime you parse a read from your input file, so you will end up continuosly overwriting your output file resulting in a final BAM with only one read.

Finally, if you want to use pysam for this you will have to define the name of all your new contigs in the header of your output file before you create any alignments. Otherwise pysam will refuse to write a read with an unknown contig name.

The code below should approximate what you want to do:

# edit the sequence names for your output header
prefix = 'prefix'
new_head = input_bam.header.to_dict()
for seq in new_head['SQ']:
    seq['SN'] = prefix + '_' + seq['SN']
# create output BAM with newly defined header
with pysam.AlignmentFile(full_output_path, "w", header=new_head) as outf:
    for read in input_bam.fetch():
        prefixed_chrom = prefix + '_' + read.reference_name
        a = pysam.AlignedSegment(outf.header)
        a.query_name = read.query_name
        a.query_sequence = read.query_sequence
        a.reference_name = prefixed_chrom
        a.flag = read.flag
        a.reference_start = read.reference_start
        a.mapping_quality = read.mapping_quality
        a.cigar = read.cigar
        a.next_reference_id = read.next_reference_id
        a.next_reference_start = read.next_reference_start
        a.template_length = read.template_length
        a.query_qualities = read.query_qualities
        a.tags = read.tags
        outf.write(a)