Python Check Correct Bam Or Correct Fastq Format
Entering edit mode
11.0 years ago
dfernan ▴ 760


I'd like to write a python function that checks if a file is in the correct bam format, or in the correct fastq format.

I'd probably like to write two functions:

check_bam(bam_in) check_fastq(fastq_in)

that'd exit the script is bam_in or fastq_in are not in the correct format.

Any suggestions regarding these checks?

fastq bam python • 2.9k views
Entering edit mode
11.0 years ago
Sangwoo Kim ▴ 440

Basically, you can use pysam library to retrieve all the entries from your input. From the example, you would do

import pysam
samfile = pysam.Samfile( "ex1.bam", "rb" )
for alignedread in samfile.fetch('chr1', 100, 120): /* you may iterated for whole genome
     *do what you want to do with an object 'pysam.Alignedread'*

The class pysam.Alignedread has many pre-defined methods. All you need is there. For example, you can check read name using 'Alignedread.rname' or you can check other properties using the flag-driven methods ('is_proper_pair', 'is_qcfail' and so on)

'Correct' somehow sounds a little ambiguous to me, because different tools require different levels of strictness. For example, read-group must be defined for GATK tools. I think you can define your own strictness by combining some check points.

Entering edit mode
11.0 years ago

I don't think it would be that easy to write a small python function that checks if the bam file is correct or corrupt. It would be easy to write a small function if you want to check for sth specific but it would be tough if you want to check everything in bam file. For 'fastq' it would be relatively easy. People tend to use Picard to check if the bam format is correct or not. Pysam is a python library that helps you to manipulate SAM/BAM format. You can take its help to write your own functions.

Entering edit mode
11.0 years ago
lexnederbragt ★ 1.3k

(Bio)python fastq and sam/bam parsers should do at least some of the checks your interested in (e.g. whether the length of the quality info is the same as the sequence length). You could make some files that have the types of corruption you are after and just read them in with (bio)python to see that exceptions they throw.


Login before adding your answer.

Traffic: 2161 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6