I hope it's ok to ask this here... could anyone help a total beginner to either use LiftoverVCF in docker to remap VCF files, or set up one of the open source, user friendly VCF filtering tools?
I want to analyse my own personal WGS as I have severe health problems that may be genetic, but can't afford a full clinical service. I have some knowledge of genetics from the medical side, but am completely new to bioinformatics - and new even just to linux or containers.
I've been trying to learn as I go but keep getting stuck. So far I've tried:
Mendel MD - I'd really like to use this but my data is GRCh38, and at the moment Mendel MD only works with GRCh37. My files are too big for the NCBI remapper, even if I cut them down to one chromosome per file, so I've been trying to use LiftoverVCF instead but can't work out what I'm doing wrong.
VCF-miner - docker setup seemed to go ok but I get stuck at the final step and can't access it through a browser.
Seqr - sounds good but so far haven't been able to set up minikube or install kubectl.
BrowseVCF and VCF.Filter - set up ok but don't work well with my files and/or so user-unfriendly that effectively unusable.
I have Windows 10 with Ubuntu on the Linux Subsystem for Windows, with docker able to run through either Windows or Ubuntu.
Please - could anyone talk me through one of these?
Let me start out with saying that an online Q&A forum is not suited for solving personal health questions and for sound conclusions you really need a professional to look into your data. We can however try to put you on the right track, but that's entirely your responsibility and I want you to acknowledge that before we continue.
To me it is unclear what exactly you aim to achieve, since you mention BrowseVCF, VCFfilter and liftover, which do not serve the same purpose.
Hi thanks for replying. That's ok, I understand about not asking personal health questions here.
Liftover would be so that I can turn my GRCh38 data into GRCh37 in order to use Mendel MD as my filtering tool, since that does not yet work with GRCh38. I know the developer is working on the update but don't know when it will be ready, and he suggested remapping to get started in the meantime.
Or I could use a different filtering tool - if I can get one working.
My main requirements are that it can filter very large VEP/SnpEff (or similar) annotated files by gene symbol and impact predictions, and that it is user friendly.
I haven't used it myself, but perhaps gemini is a solution http://gemini.readthedocs.io/en/latest/
Thank you, I will have a look.
I'm sorry, I just read this in the manual:
So seems this one is not suitable either.
I assume you haven't done the alignment yourself? If so, you could convert the reads back to fastq and repeat the alignment to the correct reference/genome build. That is always more correct than using liftover for transforming variants to another genome build.
Ah, thanks for checking that out, saved me a lot of time. I didn't do the alignment, only have the files and would probably have to pay to get them redone. If I can't get anywhere fairly soon I shall ask anyway though - it's so frustrating to have the data and not be able to analyse it.
I'm not sure how powerful your computer is, but alignment isn't too difficult.
But let's take a step back: you have an annotated vcf and want to filter on a set of gene names?
That's right, also on predicted impact. My computer is a 2 year old Thinkpad, Intel i7-5600U CPU @ 2.60GHz, 12 GB RAM. Could that cope with alignment do you think?
I'd first like to see if we can do some "manual" filtering. That would be quicker. How many genes are in your list?
By chance experience with unix tools such as grep or programming in python? No worries if you don't.
I'll follow up here in a couple of hours, past midnight in my time zone.
Sorry for delayed reply - I'd reached the posting limit as a new member so it wouldn't let me post again last night.
3 or 4 genes are main priority, then another 10 or so, then quite a few more "possibles".
I'd like to practise anything new with a particular gene where I already know some of the variants but which definitely isn't the cause of my health problems. That would give me confidence in what I'm doing before trying the others, as I will find those quite emotionally difficult because of the implications both if I do find what I'm looking for and if I don't.
If I can either liftover or realign then I think that'd be best. I've been trying out Mendel MD with another file and it seems to be just what I need - if I only had the matching alignment.
Thanks for all your help, will check in later this afternoon.
Hi, don't know if you're around today... do you think I might be able to realign the VCF myself? I've been trying LiftoverVCF again, this time in Ubuntu on the Windows Subsystem for Linux, but just keep getting errors and don't know enough linux or java to know what I'm doing wrong.
What you aim to do is not (re)alignment, but liftover. As I mentioned before it will work, it will be correct for the majority of the positions, but still a realignment of your reads would be the most accurate.
Since you insist on converting to a reference build which you can use with Mendel MD this might be your best strategy.
I'd suggest being a lot more specific with regard to which commands you used and which errors you obtained. It would probably be the best to open a new thread for this so we can help you.
I would also favor a re-alignment to whatever reference genome that is required by the other programs that you wanted to use. The company who did the sequencing for you should be able to make the raw data available to you, but you should request that fairly soon as they may have some policy of deleting large files after some time. It's your DNA, you paid for the service, and therefore the data should be yours. Request the FASTQ files and/or the BAM file.
Most of us could work with the GRCh38-aligned data but it would require some expertise in bioinformatics in order to annotate the variants and infer functional impact. Also, the interpretation of variants in a clinical context is just not something trivial. Beginners get excited when they see missense variants 'all over the place', coupled with 100s of nonsense mutations and frameshift indels. The majority of these, however, may have no adverse effect on phenotype and be virtually 'functionless'. So, you have to be aware of that, too. Some companies in California have been sued for providing misleading clinical information, in this regard.
Your computer is certainly powerful enough to conduct a re-alignment to the human genome. If you got the raw FASTQ files, then we could possibly walk you through it.
If you still want to go the 'liftover' route, then CrossMap may be a better choice.
I actually already have the BAM file so it'd be great if you could show me how to re-align from that.
Re interpretation, my plan was to first see if I have any of the known pathogenic mutations in the genes most strongly implicated by the clinical presentation. There is a fairly strong suspicion of a particular type of mendelian disorder, so that would be enough for my consultant to give a definite diagnosis. If I don't have those then I would next look to see if there are any novel predicted-high-impact variants in those genes, which my consultant's geneticist colleague could analyse in more depth, or any of the known disease-associated variants in certain other genes, where I do not have the classic condition but where atypical forms have been reported to have a phenotype like mine.
The stage after that would be to browse through the full set of high impact variants, to see if there are any in genes that I hadn't initially thought of but where the clinical presentation would be consistent with a functionally relevant mutation. At that point I would obviously have to consult a clinical geneticist if I can afford it.
Anyway - I'd be very grateful if you could talk me through aligning to GRCh37 from the BAM file.
To convert the BAMs back to FASTQ, you can use bamtofastq. Once you get the FASTQs created, can I recommend that you open a new question about how one can conduct a simple NGS data analysis experiment? That way, others will be brought into the conversation.
Thank you, that's really helpful. I'm just working out how to use bamtofastq then will open a new question as you suggest. I will get there!
Hello again, one more quick question before I open the new post: I've run bamtofastq with just the default command - would it be worth running it again with any of the options, e.g. -fq2? My sequence is from chromium 10x linked read. Thanks again for your help.
I see. So, you have the potential to do whole genome haplotype-phasing - that's a luxury. I would recommend asking 10x for advice. Ideally, for these reads, you should actually be using their proprietary pipeline called LongRanger, but it is very compute intensive and will not run on your computer.
We were not originally aware that you have 10x data.
Ah, sorry, I didn't realise that would cause difficulties. The GRCh38 VCFs I've already got include the LongRanger files and also a few others from a different pipeline (possibly GATK?), but I think he did do some things slightly differently from usual to work better with the chromium data.
Yes, it does give much greater scope and I'm looking forward to exploring things in more depth over time. Even in the short term it will be a big help, as the conditions I'm looking at are mostly autosomal recessive with significant numbers of structural variants, and a couple where pseudogenes or repetitive regions can be an issue with standard short read sequencing.
Anyway, I'll see what I can find out about aligning without LongRanger, then start a new question.
Hi, thank you and sorry for slow reply - I'm not well and can't always use the computer.
I said either liftover or realign because I thought you had said it might be possible to take it back to fastq and align to GRCh37 myself, if my computer could handle it. I may well have misunderstood that though.
Or I could try a different filtering tool, e.g. VCF-Miner or Seqr, if I can find someone willing to talk me through the setup.
Meantime I've given up on the docker install of LiftoverVCF and installed it in Ubuntu 18.04 on the Linux Subsystem for Windows instead - but can't tell if I'm doing it right or not so am just posting on the GATK forum to see if anyone can help. I understand that lifting over is less accurate than realigning, but I don't know what else to do.
Do you think I should keep trying liftover or would one of the other options be better - and if so could you talk me through it?
Since you already have an annotated vcf file and are only interested in a couple of genes, filtering shouldn't be too hard. I would do some "manual filtering" to get quickly to your result.
If I understood correctly your vcf has been annotated using VEP, or SNPeff, or similar?
In that case I would just the unix tool grep for the genes you are looking for. Since that's only a few then the numbers of variants would definitely be manageable. Example:
or with a list of genes in
Note that you would lose the header here, which you could solve using:
We hate to recommend it because it's not a proper tool for data analysis and bioinformatics, but this looks like a situation in which you can use excel afterwards.
Lol it had occurred to me that if I really couldn't get anywhere I might have to resort to excel, but it'd be a very long, tedious, error-prone process. I haven't explained very well, but the initial couple of genes are only the first step. Even if I do have clear evidence of the most strongly indicated disorder, it still wouldn't explain my full condition. It is quite likely that I will have to look at other genes too, which would be infinitely easier with a good filtering tool. The big draw of Mendel MD is that it's specifically designed for clinicians rather than research scientists, who know their patient but have no experience of bioinformatics.