Querying Gemini database by using genomic coordinates from a bed file.
1
0
Entering edit mode
3.0 years ago
eDNAuRNA ▴ 20

Hi everyone,

I have a bed file (tab separated columns) with hundreds of genomic coordinates as follows.

chr1    88833393    88834022    EXr19   1   +
chr1    22531002    22531628    EXr20   1   +
chr1    10355070    10355696    EXr21   1   +

I am trying to query a gemini database by using a genomic region based query as follows.

gemini query --header --show-samples --region 1:88833393-88834022 -q "select * from variants" gemini.db >> output.tsv

Is there a way I can generate a query for each genomic coordinate given in the bed file automatically? An urgent help will be appreciated.

Thanks

Gemini Database Query bed • 646 views
ADD COMMENT
3
Entering edit mode
3.0 years ago

Hello,

you could use gnu parallel for this:

$ parallel --dry-run --colsep "\t" 'gemini query --header --show-samples --region {1}:{2}-{3} -q "select * from variants" gemini.db' :::: regions.bed >> output.tsv

Remove the --dry-run if you're happe with the commands created.

fin swimmer

ADD COMMENT
0
Entering edit mode

Hi Fin,

thanks a bunch. This worked like a charm. I need a quick modification. The query shouldn't include "chr" from the bed file. The code you shared includes "chr" in the output and it won't work like this. Can you please suggest how to avoid adding "chr" in the output? Right now following query is being generated.

gemini query --header --show-samples --region chr1:88833393-88834022 -q "select * from variants" gemini.db >> output.tsv

Secondly, can you please explain how the code you suggested actually works? If you don't have time, please point me to a tutorial. Thirdly, what --dry-run is doing and what will happen if i remove it?

Thanks again, I am very close to solve a problem I was facing for two months.

Cheers,

ADD REPLY
1
Entering edit mode

Hello,

a good introduction to parallel is here in biostars :)

What my code do is, to start for each line in the regions.bed the command between the quotation marks. With --colsep "\t"we also tell that there are multiple arguments in each line delimited by a tab. Doing so we can use the placeholders {n} in the command.

With --dry-run we force parallel to not execute the command and just print out the command it will use instead. This is good for having a look, if everything of our input parameters is parsed correct. To finally execute the commands we need to remove the option.

To get rid of the chr we can use sed and pipe the result to parallel:

$ sed 's/^chr//'  regions.bed|parallel --dry-run  --colsep "\t" 'gemini query --header --show-samples --region {1}:{2}-{3} -q "select * from variants" gemini.db' >> output.tsv

fin swimmer

ADD REPLY
0
Entering edit mode

Hi Fin,

You are amazing. Its working perfectly :)

Thanks a bunch. Have a great weekend.

Best,

ADD REPLY

Login before adding your answer.

Traffic: 2494 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6