Question: splitting a SAM file
1
gravatar for Ric
4 weeks ago by
Ric220
Australia
Ric220 wrote:

I wrote the below script which takes a SAM file as input and creates for each contig a SAM and FASTA file:

#!/usr/bin/env python3

import click
import os

@click.command()
@click.option('--sam', help="SAM files", required=True)
@click.option('--output', help="Output directory", required=True)
def retrieve_contig_reads(sam, output):

    with open(sam) as align:
        for line in align:
            try:
                parts = line.rstrip().split('\t')
                illumina_id = parts[0]
                Illumina_seq= parts[9]
                contig_name = parts[2]

                with open(os.path.join(output, contig_name + ".fasta"), 'a') as fasta:
                    fasta.write(">" + illumina_id + '\n')
                    fasta.write(Illumina_seq + '\n')

                with open(os.path.join(output, contig_name + ".sam"), 'a') as sam_line:
                    sam_line.write(line + '\n')


            except IndexError:
                continue

if __name__ == '__main__':

    retrieve_contig_reads()

Unfortunately, it ran for more than 2 weeks and has still not completed. Is there a faster way to retrieve the alignment and the reads for each contig?

Thank you in advance,

alignment • 185 views
ADD COMMENTlink modified 4 weeks ago by WouterDeCoster39k • written 4 weeks ago by Ric220
1

your code seems relatively* efficient, no idea why it would run 2 weeks. The only thing might be the file open overhead for writing each line. You can certainly avoid that by sorting the bam/sam file before hand (samtools sort). This will sort all reads by alignment position and is a de facto standard required by many downstream programs anyway. After that, you can essentially open a new file once you see a new contig_name. Also look into the pysam module, this might be helpful, too.

*Edit: putting a bit more emphasis on relative, keeping Pierre's comment below in mind.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Carambakaracho1.1k
1

i'm not a python guy, but unless I'm wrong you're opening a new file (append mode) for every SAM record ? Opening a file costs a lot of I/O.

ADD REPLYlink written 4 weeks ago by Pierre Lindenbaum120k
6
gravatar for WouterDeCoster
4 weeks ago by
Belgium
WouterDeCoster39k wrote:

I wouldn't use python for this...

samtools index alignment.bam
samtools idxstats alignment.bam | cut -f 1 | grep -v '*' | parallel -j 8 --bar 'samtools view -o {}.bam alignment.bam {} && samtools fasta {}.bam | gzip > {}.fasta.gz'
  1. Create an index of a your bam file
  2. using idxstats, cut and grep -v get all contigs
  3. use gnu parallel to parallelize processes, adjust -j accordingly to your system
  4. for each contig: use samtools view to get the contig in a bam file, and samtools fasta to get the reads in fasta format
ADD COMMENTlink written 4 weeks ago by WouterDeCoster39k

I ran this commands:

samtools sort output.gfa1.onlymapped.sam -o output.gfa1.onlymapped.sorted.bam
samtools index output.gfa1.onlymapped.sorted.bam
samtools idxstats output.gfa1.onlymapped.sorted.bam | cut -f 1 | grep -v '*' | parallel -j 8 --bar 'samtools view -o {}.sam output.gfa1.onlymapped.sorted.sam {} && samtools fasta {}.sam > {}.fasta'

Unfortunately, all sam files are empty and no FASTA files have been generated. Please find below the log file:

[bam_sort_core] merging from 370 files and 1 in-memory blocks...
^M^M#   0 sec Backbone_8                                                            ^M0^M0% 0:4947=0s Backbone_8                                                                                                                 [main_samview] random alignment retrieval only works for indexed BAM or CRAM files.

...

99% 4946:1=0s Backbone_4947                                                     [main_samview] random alignment retrieval only works for indexed BAM or CRAM files.
100% 4947:0=0s Backbone_4947

What did I miss?

Thank you in advance,

ADD REPLYlink written 25 days ago by Ric220

First you have to run parallel with --dryrun and see if the comments which are printed are what you would expect.

Something that you could try (but is not what you originally asked for) is to write to bam in the parallel jobs, and also adjust the second part then which creates the fasta file. You may have to index the bam file first, so you could add that too:

samtools idxstats output.gfa1.onlymapped.sorted.bam | cut -f 1 | grep -v '*' | parallel -j 8 --bar 'samtools view -o {}.bam output.gfa1.onlymapped.sorted.sam {} && samtools index {}.bam && samtools fasta {}.bam > {}.fasta'
ADD REPLYlink modified 25 days ago • written 25 days ago by WouterDeCoster39k

I used your command but the files are still empty. The dryrun function created e.g. this command samtools view -o Backbone_1.bam output.gfa1.onlymapped.sorted.sam Backbone_1 && samtools index Backbone_1.bam && samtools fasta Backbone_1.bam > Backbone_1.fasta. What did I miss?

ADD REPLYlink written 25 days ago by Ric220

for trouble shooting splite the above command in its elements, for example:

samtools view -o Backbone_1.bam output.gfa1.onlymapped.sorted.sam Backbone_1

I'm pretty sure the Backbone_1 at the end is unnecessary

ADD REPLYlink written 25 days ago by Carambakaracho1.1k

Backbone_1 is a contig name in this case, so it is necessary to split the alignment per chromosome.

ADD REPLYlink written 16 days ago by WouterDeCoster39k

the input in that samtools view is a sam file output.gfa1.onlymapped.sorted.sam. The error you received is random alignment retrieval only works for indexed BAM or CRAM files.. That looks clear to me.

ADD REPLYlink written 25 days ago by WouterDeCoster39k
0
gravatar for Carambakaracho
4 weeks ago by
Carambakaracho1.1k
Switzerland/Basel
Carambakaracho1.1k wrote:

Exactly, as stated above with open(file) as f: reopens a file for each line, and removing this will increase performance, but I'd be surprised if this will increase the running time dramatically, like 2 hours instead of 2 weeks. There's something wrong in addition, but my first guess was some sort of I/O bottleneck, too

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Carambakaracho1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1035 users visited in the last hour