Deriving a Rating to Present Relative Socio-Financial Benefit and Drawback of a Geographic Space | by Jin Cui | Dec, 2023


There exist publicly accessible knowledge which describe the socio-economic traits of a geographic location. In Australia the place I reside, the Authorities by way of the Australian Bureau of Statistics (ABS) collects and publishes particular person and family knowledge frequently in respect of revenue, occupation, training, employment and housing at an space stage. Some examples of the revealed knowledge factors embrace:

  • Share of individuals on comparatively excessive / low revenue
  • Share of individuals categorized as managers of their respective occupations
  • Share of individuals with no formal academic attainment
  • Share of individuals unemployed
  • Share of properties with 4 or extra bedrooms

While these knowledge factors seem to focus closely on particular person folks, it displays folks’s entry to materials and social sources, and their skill to take part in society in a selected geographic space, finally informing the socio-economic benefit and drawback of this space.

Given these knowledge factors, is there a solution to derive a rating which ranks geographic areas from essentially the most to the least advantaged?

The objective to derive a rating could formulate this as a regression downside, the place every knowledge level or characteristic is used to foretell a goal variable, on this state of affairs, a numerical rating. This requires the goal variable to be obtainable in some cases for coaching the predictive mannequin.

Nonetheless, as we don’t have a goal variable to start out with, we could must strategy this downside in one other manner. For example, underneath the idea that every geographic areas is totally different from a socio-economic standpoint, can we goal to grasp which knowledge factors assist clarify essentially the most variations, thereby deriving a rating primarily based on a numerical mixture of those knowledge factors.

We are able to do precisely that utilizing a method referred to as the Principal Part Evaluation (PCA), and this text demonstrates how!

ABS publishes knowledge factors indicating the socio-economic traits of a geographic space within the “Knowledge Obtain” part of this webpage, underneath the “Standardised Variable Proportions knowledge dice”[1]. These knowledge factors are revealed on the Statistical Area 1 (SA1) stage, which is a digital boundary segregating Australia into areas of inhabitants of roughly 200–800 folks. It is a way more granular digital boundary in comparison with the Postcode (Zipcode) or the States digital boundary.

For the aim of demonstration on this article, I’ll be deriving a socio-economic rating primarily based on 14 out of the 44 revealed knowledge factors offered in Desk 1 of the information supply above (I’ll clarify why I choose this subset afterward). These are :

  • INC_LOW: Share of individuals residing in households with said annual family equivalised revenue between $1 and $25,999 AUD
  • INC_HIGH: Share of individuals with said annual family equivalised revenue better than $91,000 AUD
  • UNEMPLOYED_IER: Share of individuals aged 15 years and over who’re unemployed
  • HIGHBED: Share of occupied non-public properties with 4 or extra bedrooms
  • HIGHMORTGAGE: Share of occupied non-public properties paying mortgage better than $2,800 AUD monthly
  • LOWRENT: Share of occupied non-public properties paying hire lower than $250 AUD per week
  • OWNING: Share of occupied non-public properties with out a mortgage
  • MORTGAGE: Per cent of occupied non-public properties with a mortgage
  • GROUP: Share of occupied non-public properties that are group occupied non-public properties (e.g. flats or items)
  • LONE: Share of occupied properties that are lone individual occupied non-public properties
  • OVERCROWD: Share of occupied non-public properties requiring a number of additional bedrooms (primarily based on Canadian Nationwide Occupancy Normal)
  • NOCAR: Share of occupied non-public properties with no vehicles
  • ONEPARENT: Share of 1 mother or father households
  • UNINCORP: Share of properties with no less than one one that is a enterprise proprietor

On this part, I’ll be stepping by way of the Python code for deriving a socio-economic rating for a SA1 area in Australia utilizing PCA.

I’ll begin by loading within the required Python packages and the information.

## Load the required Python packages

### For dataframe operations
import numpy as np
import pandas as pd

### For PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

### For Visualization
import matplotlib.pyplot as plt
import seaborn as sns

### For Validation
from scipy.stats import pearsonr

## Load knowledge

file1 = 'knowledge/standardised_variables_seifa_2021.xlsx'

### Studying from Desk 1, from row 5 onwards, for column A to AT
data1 = pd.read_excel(file1, sheet_name = 'Desk 1', header = 5,
usecols = 'A:AT')

## Take away rows with lacking worth (113 out of 60k rows)

data1_dropna = data1.dropna()

An vital cleansing step earlier than performing PCA is to standardise every of the 14 knowledge factors (options) to a imply of 0 and normal deviation of 1. That is primarily to make sure the loadings assigned to every characteristic by PCA (consider them as indicators of how vital a characteristic is) are comparable throughout options. In any other case, extra emphasis, or larger loading, could also be given to a characteristic which is definitely not important or vice versa.

Notice that the ABS knowledge supply quoted above have already got the options standardised. That stated, for an unstandardised knowledge supply:

## Standardise knowledge for PCA

### Take all however the first column which is merely a location indicator
data_final = data1_dropna.iloc[:,1:]

### Carry out standardisation of knowledge
sc = StandardScaler()
sc.match(data_final)

### Standardised knowledge
data_final = sc.remodel(data_final)

With the standardised knowledge, PCA might be carried out in just some strains of code:

## Carry out PCA

pca = PCA()
pca.fit_transform(data_final)

PCA goals to signify the underlying knowledge by Principal Elements (PC). The variety of PCs offered in a PCA is the same as the variety of standardised options within the knowledge. On this occasion, 14 PCs are returned.

Every PC is a linear mixture of all of the standardised options, solely differentiated by its respective loadings of the standardised characteristic. For instance, the picture beneath exhibits the loadings assigned to the primary and second PCs (PC1 and PC2) by characteristic.

Picture 1 — Code to return first two Principal Elements. Picture by writer.

With 14 PCs, the code beneath offers a visualization of how a lot variation every PC explains:


## Create visualization for variations defined by every PC

exp_var_pca = pca.explained_variance_ratio_
plt.bar(vary(1, len(exp_var_pca) + 1), exp_var_pca, alpha = 0.7,
label = '% of Variation Defined',colour = 'darkseagreen')

plt.ylabel('Defined Variation')
plt.xlabel('Principal Part')
plt.legend(loc = 'greatest')
plt.present()

As illustrated within the output visualization beneath, Principal Part 1 (PC1) accounts for the biggest proportion of variance within the authentic dataset, with every following PC explaining much less of the variance. To be particular, PC1 explains circa. 35% of the variation inside the knowledge.

Picture 2 — Variation defined by PC. Picture by writer.

For the aim of demonstration on this article, PC1 is chosen as the one PC for deriving the socio-economic rating, for the next causes:

  • PC1 explains sufficiently giant variation inside the knowledge on a relative foundation.
  • While selecting extra PCs probably permits for (marginally) extra variation to be defined, it makes interpretation of the rating tough within the context of socio-economic benefit and drawback by a selected geographic space. For instance, as proven within the picture beneath, PC1 and PC2 could present conflicting narratives as to how a selected characteristic (e.g. ‘INC_LOW’) influences the socio-economic variation of a geographic space.
## Present and evaluate loadings for PC1 and PC2

### Utilizing df_plot dataframe per Picture 1

sns.heatmap(df_plot, annot = False, fmt = ".1f", cmap = 'summer time')
plt.present()

Picture 3 — Completely different loadings for PC1 and PC2. Picture by writer.

To acquire a rating for every SA1, we merely multiply the standardised portion of every characteristic by its PC1 loading. This may be achieved by:


## Get hold of uncooked rating primarily based on PC1

### Carry out sum product of standardised characteristic and PC1 loading
pca.fit_transform(data_final)

### Reverse the signal of the sum product above to make output extra interpretable
pca_data_transformed = -1.0*pca.fit_transform(data_final)

### Convert to Pandas dataframe, and be a part of uncooked rating with SA1 column
pca1 = pd.DataFrame(pca_data_transformed[:,0], columns = ['Score_Raw'])
score_SA1 = pd.concat([data1_dropna['SA1_2021'].reset_index(drop = True), pca1]
, axis = 1)

### Examine the uncooked rating
score_SA1.head()

Picture 4 — Uncooked socio-economic rating by SA1. Picture by writer.

The upper the rating, the extra advantaged a SA1 is in phrases its entry to socio-economic useful resource.

How do we all know the rating we derived above was even remotely right?

For context, the ABS truly revealed a socio-economic rating referred to as the Index of Economic Resource (IER), outlined on the ABS web site as:

“The Index of Financial Assets (IER) focuses on the monetary points of relative socio-economic benefit and drawback, by summarising variables associated to revenue and housing. IER excludes training and occupation variables as they aren’t direct measures of financial sources. It additionally excludes belongings resembling financial savings or equities which, though related, can’t be included as they aren’t collected within the Census.”

With out disclosing the detailed steps, the ABS said of their Technical Paper that the IER was derived utilizing the identical options (14) and methodology (PCA, PC1 solely) as what we had carried out above. That’s, if we did derive the proper scores, they need to be comparable towards the IER scored revealed here (“Statistical Space Stage 1, Indexes, SEIFA 2021.xlsx”, Desk 4).

Because the revealed rating is standardised to a imply of 1,000 and normal deviation of 100, we begin the validation by standardising the uncooked rating the identical:

## Standardise uncooked scores

score_SA1['IER_recreated'] =
(score_SA1['Score_Raw']/score_SA1['Score_Raw'].std())*100 + 1000

For comparability, we learn within the revealed IER scores by SA1:

## Learn in ABS revealed IER scores
## equally to how we learn within the standardised portion of the options

file2 = 'knowledge/Statistical Space Stage 1, Indexes, SEIFA 2021.xlsx'

data2 = pd.read_excel(file2, sheet_name = 'Desk 4', header = 5,
usecols = 'A:C')

data2.rename(columns = {'2021 Statistical Space Stage 1 (SA1)': 'SA1_2021', 'Rating': 'IER_2021'}, inplace = True)

col_select = ['SA1_2021', 'IER_2021']
data2 = data2[col_select]

ABS_IER_dropna = data2.dropna().reset_index(drop = True)

Validation 1— PC1 Loadings

As proven within the picture beneath, evaluating the PC1 loading derived above towards the PC1 loading published by the ABS means that they differ by a relentless of -45%. As that is merely a scaling distinction, it doesn’t affect the derived scores that are standardised (to a imply of 1,000 and normal deviation of 100).

Picture 5 — Evaluate PC1 loadings. Picture by writer.

(It’s best to be capable to confirm the ‘Derived (A)’ column with the PC1 loadings in Picture 1).

Validation 2— Distribution of Scores

The code beneath creates a histogram for each scores, whose shapes look to be nearly equivalent.

## Examine distribution of scores

score_SA1.hist(column = 'IER_recreated', bins = 100, colour = 'darkseagreen')
plt.title('Distribution of recreated IER scores')

ABS_IER_dropna.hist(column = 'IER_2021', bins = 100, colour = 'lightskyblue')
plt.title('Distribution of ABS IER scores')

plt.present()

Picture 6— Distribution of IER scores, recreated vs. revealed. Picture by writer.

Validation 3— IER rating by SA1

As the final word validation, let’s evaluate the IER scores by SA1:


## Be a part of the 2 scores by SA1 for comparability
IER_join = pd.merge(ABS_IER_dropna, score_SA1, how = 'left', on = 'SA1_2021')

## Plot scores on x-y axis.
## If scores are equivalent, it ought to present a straight line.

plt.scatter('IER_recreated', 'IER_2021', knowledge = IER_join, colour = 'darkseagreen')
plt.title('Comparability of recreated and ABS IER scores')
plt.xlabel('Recreated IER rating')
plt.ylabel('ABS IER rating')

plt.present()

A diagonal straight line as proven within the output picture beneath helps that the 2 scores are largely equivalent.

Picture 7— Comparability of scores by SA1. Picture by writer.

So as to add to this, the code beneath exhibits the 2 scores have a correlation near 1:

Picture 8— Correlation between the recreated and revealed scores. Picture by writer.

The demonstration on this article successfully replicates how the ABS calibrates the IER, one of many 4 socio-economic indexes it publishes, which can be utilized to rank the socio-economic standing of a geographic space.

Taking a step again, what we’ve achieved in essence is a discount in dimension of the information from 14 to 1, shedding some data conveyed by the information.

Dimensionality discount approach such because the PCA can be generally seen in serving to to cut back high-dimension house resembling textual content embeddings to 2–3 (visualizable) Principal Elements.

Leave a Reply

Your email address will not be published. Required fields are marked *