I'm trying to do a SQL request on the BigQuery Google service to search for family names present in my sample DRR000836, and more precisly, on the cyanobacteria phylum part but I'm not sure how to do it... Here are the 2 SQL requests that I would like to merge to have what I really want:
SELECT m.acc, tax.tax_id, tax.name AS order_name, tax.self_count, tax.total_count, tax.total_count * m.avgspotlen AS pb, tax.total_count * m.avgspotlen / (m.mbases * 1000000) * 100 AS percentage FROM `nih-sra-datastore.sra.metadata` AS m JOIN `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS tax ON m.acc = tax.acc WHERE m.acc = tax.acc AND m.acc='DRR000836' AND m.librarysource = 'METAGENOMIC' AND (tax.rank ='order' OR tax.ilevel = 1) AND assay_type = 'WGS' ORDER BY pb DESC;
acc,tax_id,order_name,self_count,total_count,pb,percentage
DRR000836,131567,cellular organisms,162,257755,133774845, 20.331 DRR000836,356,Rhizobiales,8409,134661,69889059, 10.621 DRR000836,80840,Burkholderiales,509,71175,36939825, 5.614 DRR000836,2158,Methanobacteriales,9,4735,2457465, 0.373 DRR000836,72274,Pseudomonadales,5,4413,2290347, 0.348
Here, I have the families' name but I want to add a condition that the families has to come from the cyanobacteria phylum (tax_id=1117)
This command gives every phyla present in the sample:
SELECT class_tax.name FROM `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS class_tax , `nih-sra-datastore.sra.metadata` AS m WHERE class_tax.acc = m.acc AND m.acc='DRR000836' AND class_tax.rank = 'phylum'
I tried (but failed) to merge the 2 conditions but it only split my lines into multiple others (I will have multiple lines with cellular organisms where only the phylum change for example).
SELECT m.acc, tax.tax_id, tax.name AS order_name, class_tax.name AS phylum_name, tax.self_count, tax.total_count, tax.total_count * m.avgspotlen AS pb, tax.total_count * m.avgspotlen / (m.mbases * 1000000) * 100 AS percentage FROM `nih-sra-datastore.sra.metadata` AS m JOIN `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS tax ON m.acc = tax.acc LEFT JOIN `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS class_tax ON class_tax.acc = tax.acc AND class_tax.rank = 'phylum' WHERE m.acc='DRR000836' AND m.librarysource='METAGENOMIC' AND (tax.rank='order' OR tax.ilevel=1) AND assay_type='WGS' ORDER BY pb DESC;
870 DRR000836 131567 cellular organisms Chytridiomycota 162 257755 133774845 20.330523556231
949 DRR000836 131567 cellular organisms Cyanobacteria 162 257755 133774845 20.3305235562
At the end, I want only lines Nostocales, Synechococcales, (maybe Oscillatoriophycideae). Can somebody help me?
Thank you for your answer but unfortunately I tried it and it makes the same kind of ouptut as mine, with multtiple new (false) lines (splitting my initial 79 lines into 2054). I just have a new column with the phylum name but no logic between the phylum and the order, like in the first command (for example, primates are present in the cyanobacteria) Here is the modified command :
will give:
To put it more simply with another example, I try to display every possible order for every phyla founded in my run but I don't find the logic way to ask it. When I do this kind of command, I have the same problem that the other commands, with only a duplication of my lines, without any logic link between my phyla and order.
Line phylum order_name
1 Acidobacteria Enterobacterales
79 Actinobacteria Enterobacterales