SQL request from NCBI metadata and stat_analysis tables
1
0
Entering edit mode
4 months ago
marie.harmel ▴ 10

I'm trying to do a SQL request on the BigQuery Google service to search for family names present in my sample DRR000836, and more precisly, on the cyanobacteria phylum part but I'm not sure how to do it... Here are the 2 SQL requests that I would like to merge to have what I really want:

SELECT m.acc, tax.tax_id, tax.name AS order_name, tax.self_count, tax.total_count, tax.total_count * m.avgspotlen AS pb, tax.total_count * m.avgspotlen / (m.mbases * 1000000) * 100 AS percentage FROM `nih-sra-datastore.sra.metadata` AS m JOIN `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS tax ON m.acc = tax.acc WHERE m.acc = tax.acc AND m.acc='DRR000836' AND m.librarysource = 'METAGENOMIC' AND (tax.rank ='order' OR tax.ilevel = 1) AND assay_type = 'WGS' ORDER BY pb DESC;
  • acc,tax_id,order_name,self_count,total_count,pb,percentage

    DRR000836,131567,cellular organisms,162,257755,133774845, 20.331 DRR000836,356,Rhizobiales,8409,134661,69889059, 10.621 DRR000836,80840,Burkholderiales,509,71175,36939825, 5.614 DRR000836,2158,Methanobacteriales,9,4735,2457465, 0.373 DRR000836,72274,Pseudomonadales,5,4413,2290347, 0.348

Here, I have the families' name but I want to add a condition that the families has to come from the cyanobacteria phylum (tax_id=1117)

This command gives every phyla present in the sample:

   SELECT class_tax.name FROM `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS class_tax , `nih-sra-datastore.sra.metadata` AS m WHERE class_tax.acc = m.acc AND  m.acc='DRR000836' AND class_tax.rank = 'phylum' 

I tried (but failed) to merge the 2 conditions but it only split my lines into multiple others (I will have multiple lines with cellular organisms where only the phylum change for example).

SELECT m.acc, tax.tax_id, tax.name AS order_name, class_tax.name AS phylum_name, tax.self_count, tax.total_count, tax.total_count * m.avgspotlen AS pb, tax.total_count * m.avgspotlen / (m.mbases * 1000000) * 100 AS percentage FROM `nih-sra-datastore.sra.metadata` AS m JOIN `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS tax ON m.acc = tax.acc LEFT JOIN `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS class_tax ON class_tax.acc = tax.acc AND class_tax.rank = 'phylum' WHERE m.acc='DRR000836' AND m.librarysource='METAGENOMIC' AND (tax.rank='order' OR tax.ilevel=1) AND assay_type='WGS' ORDER BY pb DESC;

870 DRR000836 131567 cellular organisms Chytridiomycota 162 257755 133774845 20.330523556231

949 DRR000836 131567 cellular organisms Cyanobacteria 162 257755 133774845 20.3305235562

At the end, I want only lines Nostocales, Synechococcales, (maybe Oscillatoriophycideae). Can somebody help me?

SQL NCBI BigQuery • 470 views
ADD COMMENT
0
Entering edit mode
4 months ago
zorbax ▴ 610

I think what you are looking for is the LEFT JOIN approach, where you can select a specific phylum. Have you tried this (next you should organize your code vertically for a better display) :

SELECT
    m.acc,
    tax.tax_id,
    tax.self_count,
    tax.total_count,
    tax.total_count * m.avgspotlen AS pb,
    tax.total_count * m.avgspotlen / (m.mbases * 1000000) * 100 AS percentage,
    class_tax.name AS phylum_name
FROM
    `nih-sra-datastore.sra.metadata` AS m
JOIN
    `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS tax
    ON m.acc = tax.acc
LEFT JOIN
    (SELECT name, acc FROM `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` WHERE rank = 'phylum') AS 
class_tax
    ON m.acc = class_tax.acc
WHERE
    m.acc = 'DRR000836'
    AND m.librarysource = 'METAGENOMIC'
    AND (tax.rank = 'order' OR tax.ilevel = 1)
    AND assay_type = 'WGS'
ORDER BY
    pb DESC;
ADD COMMENT
0
Entering edit mode

Thank you for your answer but unfortunately I tried it and it makes the same kind of ouptut as mine, with multtiple new (false) lines (splitting my initial 79 lines into 2054). I just have a new column with the phylum name but no logic between the phylum and the order, like in the first command (for example, primates are present in the cyanobacteria) Here is the modified command :

SELECT
        m.acc,
        tax.tax_id,
        tax.self_count,
        tax.total_count,
        tax.total_count * m.avgspotlen AS pb,
        tax.total_count * m.avgspotlen / (m.mbases * 1000000) * 100 AS percentage,
        class_tax.name AS phylum_name,
        tax.name AS order_name  
    FROM
        `nih-sra-datastore.sra.metadata` AS m
    JOIN
        `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS tax
        ON m.acc = tax.acc
    LEFT JOIN
        (SELECT name, acc FROM `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` WHERE rank = 'phylum') AS class_tax
        ON m.acc = class_tax.acc
    WHERE
        m.acc = 'DRR000836'
        AND m.librarysource = 'METAGENOMIC'
        AND (tax.rank = 'order' OR tax.ilevel = 1)
        AND assay_type = 'WGS'
        AND class_tax.name ="Cyanobacteria"

ORDER BY
    pb DESC;

will give:

Line acc tax_id self_count total_count pb percentage phylum_name order_name
16 DRR000836 9443 0 429 222651 0.033837537993920974 Cyanobacteria Primates
17 DRR000836 204455 0 325 168675 0.02563449848024316 Cyanobacteria Rhodobacterales
ADD REPLY
0
Entering edit mode

To put it more simply with another example, I try to display every possible order for every phyla founded in my run but I don't find the logic way to ask it. When I do this kind of command, I have the same problem that the other commands, with only a duplication of my lines, without any logic link between my phyla and order.

SELECT phylum_tax.name AS phylum, order_tax.name AS order_name
FROM `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS phylum_tax
LEFT JOIN `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` AS order_tax
ON phylum_tax.acc = order_tax.acc AND order_tax.rank = 'order'
WHERE phylum_tax.acc = "DRR000836" AND phylum_tax.rank = 'phylum';

Line phylum order_name

1 Acidobacteria Enterobacterales

79 Actinobacteria Enterobacterales

ADD REPLY

Login before adding your answer.

Traffic: 1937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6