Walkthrough III: Number of Genes/Alleles

In the database, after submissions are received and put through the pipeline, they are assigned an allele name. The allele name is based off of their Gene and how their DNA sequence is different to the reference sequence of that gene.

I wanted to look at a breakdown of these genes in the database. Who submitted them, when they were submitted, how the numbers of one gene compared to another gene, what is the break-down of class I and class II genes?

This gives me a few potential graphs to make and play with, which I will work on now.

Updated SQL:

To create the graphs that I want to make, I will need to get an updated dataset that includes the allele name that each submission is assigned. I also included the status of the submission, as some have not been processed as they have issues with them, but I think it would be good to include them.

SELECT se.subid||CHR(9)||se.submitted||CHR(9)||se.source||CHR(9)||ae.allele_name||CHR(9)||se.status AS concatenated_columns
FROM submission_entry se
LEFT JOIN allele_entry ae ON se.alleleid = ae.alleleid
WHERE se.hostdb = 'HLA';

For this I used a LEFT JOIN instead of an INNER JOIN, as I thought that submissions that haven’t been processed wouldn’t have an assigned name, and the LEFT JOIN would pick these submissions up anyways. However, it turns out that submissions that aren’t named are given the name “Unassigned”, so it wasn’t necessary anyways. I had to use a join as well as the submission_entry table does not include the allele name, this is in the table allele_entry table. The produced table looks like this:

Untitled

After this I added in new columns to find the month and year of the submitted date. In addition, I would need to find out the gene of each allele (the gene is the letters before the asterisk). To do this I created a new measure “Gene”:

Gene = IFERROR(LEFT(Sheet1[Allele], FIND("*", Sheet1[Allele])-1), "")

This DAX formula will search the Allele column and find all of the characters in allele before the character “” appears. I have an IFERROR as the formula was not working for entries with “Unassigned” as there was no “”, so instead would return a blank entry.

Making a graph

I used these new columns and made a pie chart, using the gene as the legend and year as the values. I added detail labels as well to go outside each of the slices, as it makes it easier to read and a Year slicer as well. This resulted in a graph looking like this:

Untitled

Overall, this is is quite ugly and difficult to read. The default colour choices are very varied and jarring, a bit too much on the eyes. At this point I decided to look up some ideas on colour theory for different graphs. One online article I found pointed out that although these are different sets of data (different genes), they do not need to be different colours to convey information, so you need to think carefully about what colours you use and what information you wish to present.

With this in mind I split the genes into their different biological classifications (Class I, Class II and Other) and assigned them colours based off of this. This results in a more organised visual, and still conveys useful biological information to the user.

Untitled