Let's imagine I'm developing a Python tool that can predict clinical phenotype for the gene YFG (Your Favorite Gene) which is located in chr1:1000-2000. It does so by first creating a VCF file from an aligned BAM file using
bcftools (more specifically via the Python package
pysam -- i.e.
pysam.bcftools) and then somehow mapping variant records to a known phenotype.
For variant calling the tool currently requires two input files from a user: a BAM file and a reference FASTA file. However, I want to update the tool so that it already includes a "mini" FASTA file that only contains the sequence of YFG which is 1 kb long. This means users are now required to provide their BAM only. I can't just put an entire reference FASTA file to the tool because it is usually huge (~3 GB) so the tool won't be portable anymore (portability matters to me a lot). Including mini FASTA will also improve the reproducibility of said tool because currently different users have slightly different FASTA files (e.g. different contig names such as
That being said, my question is: Is this feasible? I keep thinking there got to be a simple way to achieve my goal, but I can't think of a good way to actually pull this off.
For instance, if I were to simply use mini FASTA as described above,
bcftools will return most positions in chr1:1000-2000 as variants because every position has been left shifted by 1000. I know this because I have already tried it :(
I still think above approach could work if, for example, there is a way to tell
bcftools that the sequence in mini FASTA starts at chr1:1000. But no luck thus far. Any advice and recommendation would be highly appreciated.
Another possible workaround might be to re-align sequence reads from BAM to mini FASTA and then perform variant calling with mini FASTA. I actually think this will work quite well, but the major problem is that there is no Python package that can perform efficient and accurate read alignment (e.g.
bwa-mem). Of course, I can have users install a read aligner separately, but if I can I really want to package everything within the tool.
Thanks for reading this lengthy post!