I am a beginner in bio and python. Following is my assignment A python program that take following command line inputs:
a)a filename expecting aligned sequencing data in form of a bam file)
b)a genomic coordinate expecting a SNP coordinate in the form of chr:position
c)a minimum quality score as an int value use pysam and investigate the following properties and present them as the output to the command line
d)How many reads in the input bam file a) overlap the genomic coordinate specified as the input value b)?
e)Which nucleotides (A,C,G,T,N) are supported by those reads and how many times each? Provide two values per base. The total number of reads that support the base and have a base quality of atleast c)?
f)What is the strand support of each base(what proportion of reads maps to the forward strand of the reference)? Again total and with the minimum base quality.
My code is below, I wanted to know if i am going right and how shall i proceed further. Thank you.
import pysam import sys import os #Checking command line arguments if len(sys.argv) < 4: print ("Files not provided") sys.exit(0) #command line arguments sequence bam_file= sys.argv chromosomecoordinate=sys.argv qualityscore=int(sys.argv) try: #bam file is 1 base, pysam is 0 base snipposition=int(chromosomecoordinate.split(':'))-1 #taking the first part now chromosomenumber=chromosomecoordinate.split(':') print chromosomenumber except IndexError: print"Error" bamfile=pysam.AlignmentFile(bam_file,'rb') #intialising a list and five count variables base= for aRead in bamfile.fetch(chromosomenumber,int(snipposition),int(snipposition+1)): #print aRead start_pos=aRead.reference_start print start_pos a_count=0 c_count=0 g_count=0 t_count=0 n_count=0 base=aRead.query_sequence #print base for i in base: if i=="A": a_count+=1 elif i=="C": c_count+=1 elif i=="G": g_count+=1 elif i=="T": t_count+=1 elif i=="N": n_count+=1 print(a_count,c_count,g_count,t_count,n_count)