Question

splitting a SAM file

1

Entering edit mode

5.0 years ago

Ric ▴ 430

I wrote the below script which takes a SAM file as input and creates for each contig a SAM and FASTA file:

#!/usr/bin/env python3

import click
import os

@click.command()
@click.option('--sam', help="SAM files", required=True)
@click.option('--output', help="Output directory", required=True)
def retrieve_contig_reads(sam, output):

    with open(sam) as align:
        for line in align:
            try:
                parts = line.rstrip().split('\t')
                illumina_id = parts[0]
                Illumina_seq= parts[9]
                contig_name = parts[2]

                with open(os.path.join(output, contig_name + ".fasta"), 'a') as fasta:
                    fasta.write(">" + illumina_id + '\n')
                    fasta.write(Illumina_seq + '\n')

                with open(os.path.join(output, contig_name + ".sam"), 'a') as sam_line:
                    sam_line.write(line + '\n')


            except IndexError:
                continue

if __name__ == '__main__':

    retrieve_contig_reads()

Unfortunately, it ran for more than 2 weeks and has still not completed. Is there a faster way to retrieve the alignment and the reads for each contig?

Thank you in advance,

alignment • 2.3k views

ADD COMMENT • link updated 5.0 years ago by WouterDeCoster 47k • written 5.0 years ago by Ric ▴ 430

1

Entering edit mode

your code seems relatively* efficient, no idea why it would run 2 weeks. The only thing might be the file open overhead for writing each line. You can certainly avoid that by sorting the bam/sam file before hand (samtools sort). This will sort all reads by alignment position and is a de facto standard required by many downstream programs anyway. After that, you can essentially open a new file once you see a new contig_name. Also look into the pysam module, this might be helpful, too.

*Edit: putting a bit more emphasis on relative, keeping Pierre's comment below in mind.

ADD REPLY • link 5.0 years ago by Carambakaracho ★ 3.2k

1

Entering edit mode

i'm not a python guy, but unless I'm wrong you're opening a new file (append mode) for every SAM record ? Opening a file costs a lot of I/O.

ADD REPLY • link 5.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

5.0 years ago

Carambakaracho ★ 3.2k

Exactly, as stated above with open(file) as f: reopens a file for each line, and removing this will increase performance, but I'd be surprised if this will increase the running time dramatically, like 2 hours instead of 2 weeks. There's something wrong in addition, but my first guess was some sort of I/O bottleneck, too

ADD COMMENT • link 5.0 years ago by Carambakaracho ★ 3.2k

score 6 · Accepted Answer · 2019-04-24

6

Entering edit mode

5.0 years ago

WouterDeCoster 47k

I wouldn't use python for this...

samtools index alignment.bam
samtools idxstats alignment.bam | cut -f 1 | grep -v '*' | parallel -j 8 --bar 'samtools view -o {}.bam alignment.bam {} && samtools fasta {}.bam | gzip > {}.fasta.gz'

Create an index of a your bam file
using idxstats, cut and grep -v get all contigs
use gnu parallel to parallelize processes, adjust -j accordingly to your system
for each contig: use samtools view to get the contig in a bam file, and samtools fasta to get the reads in fasta format

ADD COMMENT • link 5.0 years ago by WouterDeCoster 47k

0

Entering edit mode

I ran this commands:

samtools sort output.gfa1.onlymapped.sam -o output.gfa1.onlymapped.sorted.bam
samtools index output.gfa1.onlymapped.sorted.bam
samtools idxstats output.gfa1.onlymapped.sorted.bam | cut -f 1 | grep -v '*' | parallel -j 8 --bar 'samtools view -o {}.sam output.gfa1.onlymapped.sorted.sam {} && samtools fasta {}.sam > {}.fasta'

Unfortunately, all sam files are empty and no FASTA files have been generated. Please find below the log file:

[bam_sort_core] merging from 370 files and 1 in-memory blocks...
^M^M#   0 sec Backbone_8                                                            ^M0^M0% 0:4947=0s Backbone_8                                                                                                                 [main_samview] random alignment retrieval only works for indexed BAM or CRAM files.

...

99% 4946:1=0s Backbone_4947                                                     [main_samview] random alignment retrieval only works for indexed BAM or CRAM files.
100% 4947:0=0s Backbone_4947

What did I miss?

Thank you in advance,

ADD REPLY • link 5.0 years ago by Ric ▴ 430

0

Entering edit mode

First you have to run parallel with --dryrun and see if the comments which are printed are what you would expect.

Something that you could try (but is not what you originally asked for) is to write to bam in the parallel jobs, and also adjust the second part then which creates the fasta file. You may have to index the bam file first, so you could add that too:

samtools idxstats output.gfa1.onlymapped.sorted.bam | cut -f 1 | grep -v '*' | parallel -j 8 --bar 'samtools view -o {}.bam output.gfa1.onlymapped.sorted.sam {} && samtools index {}.bam && samtools fasta {}.bam > {}.fasta'

ADD REPLY • link 5.0 years ago by WouterDeCoster 47k

0

Entering edit mode

I used your command but the files are still empty. The dryrun function created e.g. this command samtools view -o Backbone_1.bam output.gfa1.onlymapped.sorted.sam Backbone_1 && samtools index Backbone_1.bam && samtools fasta Backbone_1.bam > Backbone_1.fasta. What did I miss?

ADD REPLY • link 5.0 years ago by Ric ▴ 430

0

Entering edit mode

for trouble shooting splite the above command in its elements, for example:

samtools view -o Backbone_1.bam output.gfa1.onlymapped.sorted.sam Backbone_1

I'm pretty sure the Backbone_1 at the end is unnecessary

ADD REPLY • link 5.0 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

Backbone_1 is a contig name in this case, so it is necessary to split the alignment per chromosome.

ADD REPLY • link 4.9 years ago by WouterDeCoster 47k

0

Entering edit mode

the input in that samtools view is a sam file output.gfa1.onlymapped.sorted.sam. The error you received is random alignment retrieval only works for indexed BAM or CRAM files.. That looks clear to me.

ADD REPLY • link 5.0 years ago by WouterDeCoster 47k