In this section I am talking through updating the SQL query that I used to get the data for this report. Currently I am only getting a number of columns in SQL, and then creating calculated columns in Power BI using this data. My aim is to improve upon this and collect this same data in SQL and not rely on calculated columns. This will reduce the memory load on the report as it will no longer need to calculate a number of columns.

Currently I am using calculated columns to find the: Month_Submitted, Month_Assigned, Year_Submitted, Year_Assigned, Gene, Class and Processing_Time.

Changing some of these will be an easy SQL query, others will require more complex subqueries.


This is what my current SQL query looks like:

select submission_entry.subid AS Subid,
               submission_entry.submitted AS Submitted,
               submission_entry.assigned AS Assigned,
               submission_entry.status AS Status,
               submission_entry.source AS Source,
               allele_entry.allele_name AS Allele,
               submittor_entry.fullname As Submitter,
               submittor_address.country AS Country 
        from submission_entry
        left join allele_entry on submission_entry.alleleid = allele_entry.alleleid
        left join submittor_entry on submission_entry.submittor_ID = submittor_entry.submittor_ID
        left join submittor_address on submission_entry.submittor_ID = submittor_address.submittor_ID
        where submission_entry.hostdb = 'HLA';

The first set of calculate columns I’d like to update are the Month/Year Submitted/Assigned columns. This can be done by extracting the relevant information from the date submitted and date assigned fields. This will look like the following:

EXTRACT(MONTH FROM submission_entry.submitted) AS Month_Submitted,
EXTRACT(YEAR FROM submission_entry.submitted) AS Year_Submitted,
EXTRACT(MONTH FROM submission_entry.assigned) AS Month_Assigned,
EXTRACT(YEAR FROM submission_entry.assigned) AS Year_Assigned,

Next I want to find the Gene in each submission. This can be done by stripping data away from the allele field. In HLA, the allele name is made up of the gene and a unique identifier.

A*01:01:01:01 (allele) is from the gene A

DRB1*01:01:01:01 (allele) is from the gene DRB1

This pattern is followed by all HLA, KIR and other genes we have in the database. So to find the gene I will need to strip all characters after and including the *. As the gene can be 1, 3 or 4 letters, we will need to use a combination of SUBSTR and INSTR. SUBSTR is used to find the starting character we want and INSTR to find the end character.

SUBSTR(allele_entry.allele_name, 1)

Will find the position of the first character, which is what we want.

INSTR(allele_entry.allele_name, '*')

Will find the location of the *. However, I want the position of the character just before the *:

INSTR(allele_entry.allele_name, '*') - 1)

When combined, it will look like this:

SUBSTR(allele_entry.allele_name, 1, INSTR(allele_entry.allele_name, '*') - 1) AS Gene

The next query I want to write is to find the class of each gene. In HLA there are 3 classes: class I, class II and other, this has a basis on their biology. There are several genes in each character: