Data to Insight: UK National Census 2011

3rd March 2014

A Brief Description of The Dataset

I will be using the UK National Census 2011 data that has been made publicly available by the Office for National Statistics (ONS). Although the data is relatively old, it was only released on 23rd January 2014. The fact (or statistical) data has been provided by ONS in CSV format but the lookup/reference data is supplied in a PDF document. The dataset is a 1% representative sample of all the Census 2011 results. ONS refer to it as a ‘Teaching file’ and it contains over 500,000 rows of data.

ONS-Teaching-File-Webpage

You can freely download the dataset and learn more about it here.

Applying Data Transformations Using Power Query

I will explain the first set of steps in detail for the Power Query related activities. All other activities and findings will be summarized with bullet points.

The first thing that I needed to do was to load the data into Excel, so I turned to Power Query. I chose the From File > From CSV data loading method found under the Power Query tab.

…then I chose the Census CSV file.

Power Query extracted all of the rows from the CSV file successfully. The data in row 1 isn’t really part of the dataset and it needed to be removed.

Enter Power Query’s Remove Top Rows transformation which allows us to remove unwanted rows starting from the top of the dataset. In this case, I only needed to remove a single row so I entered ‘1’ in the Number of rows field before proceeding.

The unwanted row was successfully removed.

The next problem was that the header row for the dataset was being treated like a data row. To fix this I used the Use First Row As Headers transformation and, like magic, each column was given the correct heading.

At this point the dataset was nearly good enough to load into Power Pivot but before doing so, I needed to make sure that the correct data types were set for each column. I did this by selecting each column and then picking the correct data type in the Data Type field. It just so happened that all of the columns except for ‘Region’ and ‘Residence Type’ contained numeric data so I set Number as the data type for these. Power Query had already selected the Text data type for the ‘Region’ and ‘Residence Type’ columns which was spot on.

Those of you who have worked with Power Query before or who have been paying close attention to the images will have noticed that the Applied Steps list has been growing with each different data transformation that I applied. You can see in the image above that there is a new step in the Applied Steps list called ‘ChangedType’. What’s interesting to note is that regardless of whether I changed the column data type one column at a time or highlighted several columns to change in one go, it only recorded a single step for this transformation. One of many things that I really like about Power Query is how the Applied Steps list can be used to check how the data looked before and after a transformation step was applied. It also allows you to amend the transformation in the event that a mistake was made. At this stage the dataset had been transformed into a structure that I was happy with and the table had been given a more meaningful name than ‘Query1’; it was renamed to ‘UKNationalCensus2011’.

Notice in the image above that I unchecked the Load to worksheet and checked the Load to Data Model option under the Load Settings area. This was because I wanted Power Query to load the data straight into the Power Pivot model without also loading it into an Excel worksheet. The only thing left to do at this stage was to click Apply & Close to start loading the data into Power Pivot.

Power Query took under a minute to load the data into Power Pivot.

Let’s take a quick look at the Power Pivot window.

As you can see, the data is all there and the headers are correct. But what’s this…? The fields that were given the Number data type in Power Query have been loaded into Power Pivot with a Decimal Number data type? I did sort this out later on but this is a good time to point out that if a column will only ever contain whole numbers (i.e. integers) and you are loading them via Power Query, be sure to set them to the Whole Number data type after the table is first loaded into Power Pivot. This will contribute towards a smaller Power Pivot Model size and, consequently, a smaller workbook file size. It will also contribute towards a better performing model if the columns are used for aggregation calculations and/or to link tables together.

So far I’ve loaded the factual dataset (i.e. fact table) but what about the reference (i.e. lookup) tables? These were held in a PDF document so I had to take a slightly different approach; I simply did a copy and paste of the content from the PDF document into a blank worksheet in the Excel workbook. I’m sure that there are other more clever ways of getting the data into a Power Query accessible format, but this was sufficient in this scenario.

This is what the single column, mixed dataset looked like after being pasted into Excel…

Having placed the data into an Excel worksheet, I went straight for Power Query’s From Table data loading method. I de-selected the My table has headers checkbox before continuing with the load because the column contained mixed data.

The single column, mixed dataset was successfully loaded into Power Query. I already knew at this stage that it would be a lot more effort to get this into the right shape…but maybe not as much as you would initially think.

Power Query has a very handy transformation called Keep Range which allowed me to get a sub-set of the rows in the dataset. At this point in time, I was only interested in the first reference table called ‘Region’. Take another look at the image above and you will notice that the values for ‘Region’ start at row 6 and end at row 15. So, for the First row field I entered ‘6’, and for the Number of rowsfield I entered ’10’ as there were 10 rows of data in total, including row 6.

The Power Query magic gave us back 10 rows as expected but there were still a few things that had to be done to this data before it was ‘Power Pivot worthy’.

Firstly, I needed to get rid of the ‘Region 10 ‘ text that preceded the valid data in row 1 so I turned to Power Query’s Replace Values transformation. In the Value To Find field I typed ‘Region 10 ‘, which I could have alternatively copied and pasted, and I left the Replace With field blank and clicked ‘OK’.

This effectively removed the unwanted text from row 1. Technically, the transformation is actually being applied to each row but, as you can see, none of the other rows contain the unwanted text.

After removing the unwanted text, the next issue was that both sets of data, the region’s unique identifier (i.e. ID) and name, were stored in a single column. I needed a way to split this data into two separate columns.

That’s where the Split Column transformation came in handy. You will notice in the image above that in each row, the ID and name values are separated by a single blank space. For the Select or enter delimiter field of the Split Columntransformation, I selected Space, and for the Split option I selected At the left-most delimiter. This told Power Query to look for the first blank space in each row and use this to split the column.

Great! So at this point I had two columns, just as required, and the table looked a lot more like the reference table that it was meant to be. That said, I wasn’t too fond of the default column names so I used the Rename transformation on each column to give them meaningful names.

After all of that, the Region table was finally ‘good enough’ to load into Power Pivot and you can see that I’ve renamed the table from ‘Table1’ to ‘Region’ and told Power Query to only Load to Data Model. I then happily clicked Apply & Close.

Within seconds, the 10 rows were successfully loaded into Power Pivot.

Let’s take a quick look at the new Region table in the Power Pivot window’s Data View

…And in the Power Pivot window’s Diagram View

Creating and Refining The Power Pivot Data Model

I followed similar steps to the ones already described to load the other 16 reference tables into the Power Pivot Model. A number of things were done to create a user-friendly Power Pivot Model.

Power Query

  • Renamed the columns and tables (again) for consistency and to make the model more user friendly.
  • Enhanced the ‘Approximate Social Grade’ reference table by adding the social grade descriptions (I grabbed the descriptions from Wikipedia). This was achieved by using the Insert Custom Column transformation and writing an ‘if…else’ expression in ‘M’, the Power Query formula language, that returned the corresponding description for each Social Grade.

Here’s the ‘M’ expression:

=​Table.AddColumn(RenamedColumns, “Custom”, each if [Approximated Social Grade] = “AB” then “Upper Middle Class and Middle Class”
else if [Approximated Social Grade] = “C1” then “Lower Middle Class”
else if [Approximated Social Grade] = “C2” then “Skilled Working Class”
else if [Approximated Social Grade] = “DE” then “Working Class and Non-Working Class”
else “No code required”)

Power Pivot

  • Created the table relationships: This was quite straightforward because there was a single table with factual data and all of the other (reference) tables were used as look-ups; it was a simple star schema.
  • Set the data type to Whole Number for columns that were set to theNumber data type in Power Query
  • Hid all ID columns from the client tools; ID columns are not meaningful to the user and should always be hidden.
  • Removed the Person ID column from the fact table. With the column included, the Excel workbook was nearly 10 MB in size (which was still 3x smaller than the original CSV file) in comprison to a tiny 2 MB without it. This was not a useful column from a reporting perspective nor was it required for any table relationships; it could safely be removed without causing any issues. It’s important to note that I removed this column using Power Query’s Remove columns transformation to avoid it from being processed any further than Power Query.
  • Set the Sort By Column property for all columns with values that should be display in a specific order.
  • Created a hierarchy in the ‘Hours Worked Per Week’ table with a Category -> Sub Category type structure (see the diagram below).
  • Defined three measures (i.e. Calculated Fields) against the Census fact table: ‘People’, ‘All People’ (this is a hidden measure), and ‘% of All People’ (derived from the ‘People’ and ‘All People’ measures).
​Measure/Calculated Field DAX ​Formula
People COUNTROWS(‘UK National Census 2011’)
​​All People CALCULATE([People], ALL(‘UK National Census 2011’))
% of All People DIVIDE([People], [​​All People])
  • Set the Format property for each of the measures to ensure they are displayed in the correct format.

Getting Insight from Data Visualisations in Power View

Two dashboards were created in Power View. Since pictures are worth a thousand words, I’m not going to describe what was done but I have provided images. As the dataset is a 1% representative sample of all the Census 2011 results, I have used percentages in the Power View dashboards as opposed to actuals. This will allow us to potentially draw somewhat useful conclusions from this relatively small dataset.

The ‘Gender Analysis’ dashboard in the above image tells us that, out of all the Census 2011 participants that were in very good health, worked 49 or more hours a week, were in a professional occupation, and lived in London:

  • Most males and female were between the age of 25 and 34, seconded by ages 35 to 44.
  • There were more single men than women.
  • There were roughly double the number of married men than married women.
  • The most common industries for people that believed they were in very good health were the following: Real estate activities; Professional, scientific and technical activities; Administrative and supportive services activities; Education.

The ‘Health Analysis’ dashboard in the above image tells us that, out of all the Census 2011 participants that were aged 25 to 34, Self-employed, worked 31 to 48 hours a week, and lived in the South West UK:

  • The majority were in skilled trades occupations.
  • Skilled trades occupations​ had roughly 2 times the number of people who believed they were in very good health in comparison to those who believed they were in good health.
  • An individuals social grade did not have a significant impact on whether they believed they were in very good health.
  • In comparison to the other marital statuses, there may be a correlation between being separated and the belief that an individual is only in fair health.

Can you think of any other conclusions that can be drawn from these visualisations? Imagine how much more insight you can get just from changing the filters…

Do you want to see first-hand what other questions the data can answer? Or do you simply want to get your head around what has been done? You can freely download and share the workbook. We only ask that you credit us, and the Office of National Statistics, and honour the Open Government License (OGL).

Download the workbook from the Microsoft TechNet Gallery.

You will need the version of Excel 2013 that comes with Office Professional Plus 2013, Office 365 ProPlus, or the stand-alone version in order to successfully open this workbook. You can download an​ evaluation version of Office 365 ProPlushere. You will also need to install Power Query, and enable the Power Pivot andPower View add-ins. This workbook was created on, and tested with, the 64 bit version of Excel on a laptop with a 1.4Ghz Core i3 (dual core) processor and 4 GB of ram. Therefore, we recommend that you open this workbook on a PC with an equivalent or better specification.

Follow us on Twitter

Follow @nimblelearn​​ to stay up to date with our latest blog posts, Business Intelligence related news, and thoughts.

It’s Arrived! Power BI Has Landed!

11th February 2014

After months of playing with the Power BI Preview, we are excited that Microsoft has announced it’s general availability.

 

Read all about it here.

Power BI General Availability