Data Analytics and Dashboarding
Stat testing, visualisation and dashboarding
Faisal Jina
22 February 2021
1 Dashboard
The dashboard to accompany this report can be found at: https://faisaljina.shinyapps.io/dataanalyticsdashboard/
2 Background
This project will examine the trends of the housing market in Britain. The dataset used will be the most recent iteration of the UK House Price Index data. This dataset is released monthly as a CSV file, available from the gov.uk website. This is read in to R and the structure examined.
3 Exploratory Analysis
## Rows: 131,222
## Columns: 54
## $ Date <chr> "01/01/2004", "01/02/2004", "01/03/2004", "0...
## $ RegionName <chr> "Aberdeenshire", "Aberdeenshire", "Aberdeens...
## $ AreaCode <chr> "S12000034", "S12000034", "S12000034", "S120...
## $ AveragePrice <dbl> 81693.67, 81678.76, 83525.10, 84333.68, 8637...
## $ Index <dbl> 40.86421, 40.85676, 41.78032, 42.18478, 43.2...
## $ IndexSA <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ X1m.Change <dbl> NA, -0.01824784, 2.26048321, 0.96807069, 2.4...
## $ X12m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ AveragePriceSA <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ SalesVolume <int> 388, 326, 453, 571, 502, 525, 652, 512, 497,...
## $ DetachedPrice <dbl> 122490.1, 121280.9, 123395.4, 122334.0, 1244...
## $ DetachedIndex <dbl> 43.61098, 43.18047, 43.93332, 43.55543, 44.3...
## $ Detached1m.Change <dbl> NA, -0.9871659, 1.7435088, -0.8601624, 1.769...
## $ Detached12m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ SemiDetachedPrice <dbl> 70563.16, 70804.42, 72689.07, 74484.23, 7663...
## $ SemiDetachedIndex <dbl> 40.82189, 40.96146, 42.05176, 43.09029, 44.3...
## $ SemiDetached1m.Change <dbl> NA, 0.3419153, 2.6617665, 2.4696403, 2.89121...
## $ SemiDetached12m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ TerracedPrice <dbl> 55319.63, 55720.08, 57362.86, 59193.40, 6120...
## $ TerracedIndex <dbl> 38.30567, 38.58295, 39.72049, 40.98803, 42.3...
## $ Terraced1m.Change <dbl> NA, 0.7238695, 2.9482802, 3.1911554, 3.39463...
## $ Terraced12m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FlatPrice <dbl> 48016.07, 49030.18, 50349.45, 51736.22, 5323...
## $ FlatIndex <dbl> 42.43355, 43.32975, 44.49564, 45.72118, 47.0...
## $ Flat1m.Change <dbl> NA, 2.1120161, 2.6907230, 2.7543015, 2.88773...
## $ Flat12m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ CashPrice <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ CashIndex <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Cash1m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Cash12m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ CashSalesVolume <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ MortgagePrice <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ MortgageIndex <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Mortgage1m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Mortgage12m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ MortgageSalesVolume <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FTBPrice <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FTBIndex <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FTB1m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FTB12m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FOOPrice <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FOOIndex <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FOO1m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FOO12m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NewPrice <dbl> 88436.14, 88606.45, 90296.91, 90319.88, 9198...
## $ NewIndex <dbl> 40.26725, 40.34479, 41.11451, 41.12496, 41.8...
## $ New1m.Change <dbl> NA, 0.19257621, 1.90783778, 0.02543242, 1.84...
## $ New12m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NewSalesVolume <int> 103, 107, 140, 180, 167, 164, 163, 130, 142,...
## $ OldPrice <dbl> 81043.95, 80965.30, 82903.24, 84003.99, 8622...
## $ OldIndex <dbl> 40.88337, 40.84369, 41.82130, 42.37659, 43.4...
## $ Old1m.Change <dbl> NA, -0.0970528, 2.3935490, 1.3277553, 2.6412...
## $ Old12m.Change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ OldSalesVolume <int> 285, 219, 313, 391, 335, 361, 489, 382, 355,...
Fig. 1 - Full Dataset structure
Figure 1 shows the data that is recorded on house prices across the UK. This is evidently a large dataset that will need reducing for the purpose of a pragmatic analysis. Looking at the head and tail of this data, it can be seen that a new data row is available each month for each region, with data running from 1968 to 2020.
4 Aims of Analysis
This report will look at the current trends in house prices of the 3 countries of Great Britain. A 10-year period is selected to allow for establishing an idea of longer-term trends, which can be further filtered as required for short-term trends. Columns of interest selected include Average Price and Region data, as well as the 12-month Percentage Change column. This will help with like-for-like comparison between regions.
Date | RegionName | AreaCode | AveragePrice | X12m.Change | SalesVolume |
---|---|---|---|---|---|
2010-12-01 | Aberdeenshire | S12000034 | 179526.1 | 2.8978449 | 335 |
2011-01-01 | Aberdeenshire | S12000034 | 174067.8 | 0.6663602 | 214 |
2011-02-01 | Aberdeenshire | S12000034 | 174751.1 | 0.8346159 | 189 |
2011-03-01 | Aberdeenshire | S12000034 | 174139.6 | 2.1862236 | 329 |
2011-04-01 | Aberdeenshire | S12000034 | 178523.2 | 3.2983790 | 401 |
2011-05-01 | Aberdeenshire | S12000034 | 181756.7 | 3.3686994 | 363 |
Fig. 2 - First rows of data
Date | RegionName | AreaCode | AveragePrice | X12m.Change | SalesVolume | |
---|---|---|---|---|---|---|
51235 | 2020-06-01 | Yorkshire and The Humber | E12000003 | 168183.9 | 2.765173 | 4654 |
51236 | 2020-07-01 | Yorkshire and The Humber | E12000003 | 169321.7 | 2.625537 | 5011 |
51237 | 2020-08-01 | Yorkshire and The Humber | E12000003 | 172339.2 | 4.102662 | 5146 |
51238 | 2020-09-01 | Yorkshire and The Humber | E12000003 | 173057.1 | 4.600707 | 5594 |
51239 | 2020-10-01 | Yorkshire and The Humber | E12000003 | 177412.5 | 6.729787 | NA |
51240 | 2020-11-01 | Yorkshire and The Humber | E12000003 | 180855.5 | 9.691853 | NA |
Fig. 3 - Last rows of data
Checking the head and tail of the filtered data, it now appears to be uniform with a manageable and relevant structure from which to continue the analysis. As the focus is on the 3 countries of Great Britain, these 3 regions as well as Great Britain are extracted from the the data. This leaves us with a dataset of 480 rows x 4 columns.
## Rows: 480
## Columns: 4
## $ Date <date> 2010-12-01, 2011-01-01, 2011-02-01, 2011-03-01, 2011...
## $ RegionName <fct> England, England, England, England, England, England,...
## $ AveragePrice <dbl> 176035.9, 174442.3, 173810.6, 173045.6, 175490.5, 174...
## $ `%12m.Change` <dbl> 1.090987796, -0.008770055, -0.819974837, -0.983825702...
Fig. 4 - Glimpse of the filtered data
To observe the changes in house prices over time, it is useful to view this data diagrammatically.
Fig. 5 - Regional House Price Graph
Figure 5 gives an idea of what the data looks like in each region of interest. In particular, it shows that the 3 countries largely follow the general trend shown by Great Britain. This is examined in more detail in section 5.
5 Long-term Trend
For long-term modelling, it is useful for businesses to know the general trend of the national housing market to inform pricing forecasts where real estate is involved. Whilst the average national house price appears to be largely linear with respect to time, a simple linear regression between these variables can help to determine if a linear model does indeed represent house prices well over the long-term.
5.1 Linear Regression Hypotheses
Null hypothesis: There is not a linear relationship between the average house price and time
Alternative hypothesis: There is a linear relationship between the average house price and time
##
## Call:
## lm(formula = AveragePrice ~ Date, data = dfGB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10542.4 -3716.3 2.8 4336.7 10125.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.090e+05 7.538e+03 -27.73 <2e-16 ***
## Date 2.465e+01 4.490e-01 54.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5186 on 118 degrees of freedom
## Multiple R-squared: 0.9623, Adjusted R-squared: 0.962
## F-statistic: 3014 on 1 and 118 DF, p-value: < 2.2e-16
Fig. 6 - Simple Linear Model
5.2 Results
The linear regression model and coefficients were all significant at the 95% level (p < 0.05). The null hypothesis is therefore rejected and the alternative hypothesis accepted, and this confirms with a high degree of confidence that a linear relationship exists between the average house price and time. The adjusted R-squared value shows that 96% of the variance in the average house price is captured by this simple linear model based on this data, which is a very high result.
5.3 Outcome
Based on this linear trend, one could reasonably assume persistence of this trend for forecasting purposes. Whilst there is volatility around this trendline, this is a long-term trend with a high R-squared, so there is high degree of confidence in extrapolating this going forward.
A company using forecasting models on house prices would typically make HPI calculations at least every quarter, so it is suggested that this trendline should be recalculated at the same time to look for changes over time, and update forecasting models as appropriate. The linear model is displayed in the graph below (also available in the dashboard).
Fig. 7 - Graph of British House Price Trend
6 Regional Variation
Whilst house prices in Great Britain were linear over the longer term, the regional graph shows that the 3 countries may exhibit slightly different trends in the shorter term, particularly more recently. Boxplots are plotted to help identify any trend deviations in the 12-monthly price changes.
Fig. 8 - Regional Boxplots of Average House Price - 10 Year
Fig. 9 - Regional Boxplots of Average House Price - 2 Year
As evidenced by the long-term regional graph (Fig.5) and 10-year boxplot (Fig. 8), all 3 countries appear to show a similar pattern of average house price change over the long term, with England and Great Britain appearing to be most similar. However, from the boxplots it is evident that as the time period is narrowed to a more recent subset e.g. the past 2 years (Fig. 9), it appears that Wales increasingly stands out as potentially having greater average house price changes than the other regions. (Boxplots also available on dashboard - select a region of the graph to see the corresponding boxplots).
It may be useful to examine Wales and England to determine if the difference observed is significant - if so, these markets may need to be treated differently in financial modelling.
6.1 Short-term Trend
As the housing stock in Wales and England may be different, these regions are treated as independent groups. A 3-year subset of the 12-month changes is taken for these regions - this ensures the period is small enough to be relevant, but large enough to have enough data to draw reasonably reliable conclusions.
6.2 Normality Test
Firstly, a Shapiro-Wilk test is made to see if the differences between these groups is normal, to determine the testing going forward.
Null hypothesis: The distribution is normally distributed.
Alternative hypothesis: The distribution is not normally distributed.
##
## Shapiro-Wilk normality test
##
## data: diff$`%12m.Change`
## W = 0.98124, p-value = 0.7865
Fig. 10 - Test of Normality on ‘Wales minus England’ data
The Shapiro-Wilk p-value >> 0.05, so the null hypothesis is accepted, which indicates that the distribution of the difference between these groups is not significantly different from the normal. We therefore assume normality, and a two sample t-test can be run.
6.3 T-test
The two-sample t-test examines the difference in means of the 12-monthly percentage change of average house prices between Wales and England. This is run as unpaired, as these groups are assumed to be independent, and a Welch test is used as the variance in these groups may be different.
Null hypothesis: There is no difference between the means of the two groups.
Alternative hypothesis: There is a difference between the means of the two groups.
##
## Welch Two Sample t-test
##
## data: Wal3 and Eng3
## t = 4.9388, df = 67.993, p-value = 5.386e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.089308 2.566344
## sample estimates:
## mean of x mean of y
## 4.037337 2.209511
Fig. 11 - Two-sample Unpaired t-test
The t-test p-value << 0.05, indicating that we should reject the null hypothesis and accept the alternative hypothesis - that the difference in means between these groups is significant at the 95% level. The mean of the Wales group was also calculated as being higher than the England group.
6.4 Outcome
The result of the t-test informs us that over the past 3 years, the average 12-month percentage change in Wales has been greater than that in England. Whilst this was not apparent in the Average House Price graph, this may be due to the absolute difference in house price between the regions. This disparity can be resolved by indexing these regions both to an arbitrary value of 100 at a point 4 years ago (4 years chosen as previous tests were on a 3 year sample using a 12-month price change).
6.4.1 Indexed House Prices - Wales vs England
Fig. 12 - Graph of Indexed House Prices - Wales and England
Figure 12 (also available on dashboard) shows that Wales’ average house prices have indeed risen faster (in percentage terms) than those in England over this time period, and the tests show that this difference was significant.
The effect of this finding for businesses is that shorter-term growth in these 2 markets can not each be assumed to be uniform across Britain. The differences between the markets of Wales and England may need to be modelled separately, which in turn will impact forecasts and risk profiles of any assets/liabilities linked to real estate in these regions. Again, this finding should be monitored over time to see if the discrepancy between these regions persists into the future. The skewing of the housing market in this way may present an opportunity for businesses to take advantage of increased demand in Wales with the view of a greater increase in property values over time relative to England.
7 References
- ‘flexdashboard: Easy interactive dashboards for R’. Flexdashboard. Available at: https://rmarkdown.rstudio.com/flexdashboard/. Accessed: 20 Feb 2021.
- ‘ggplot2 Brushing’. JJ Allaire. Available at: https://jjallaire.shinyapps.io/shiny-ggplot2-brushing/. Accessed: 21 Feb 2021.
- ‘ggplot2 Quick Reference: colour (and fill)’. Sape Research Group. Available at: http://sape.inf.usi.ch/quick-reference/ggplot2/colour. Accessed: 18 Feb 2021.
- ‘Markdown Basics’. RStudio. Available at: https://rmarkdown.rstudio.com/authoring_basics.html. Accessed: 22 Feb 2021
- ‘Paired vs Unpaired T-Test: Differences, Assumptions and Hypotheses’. Nicole Gleichmann. 14 Feb 2020. Available at: https://www.technologynetworks.com/informatics/articles/paired-vs-unpaired-t-test-differences-assumptions-and-hypotheses-330826. Accessed: 17 Feb 2021.
- ‘Smoothed conditional means’. ggplot2. Available at: https://ggplot2.tidyverse.org/reference/geom_smooth.html. Accessed: 16 Feb 2021.
- ‘Statistical tools for high-throughput data analysis’. STHDA. Available at: http://www.sthda.com/english/wiki/unpaired-two-samples-t-test-in-r. Accessed: 16 Feb 2021.
- ‘Themes’. ggplot2. Available at: https://ggplot2-book.org/polishing.html. Accessed: 16 Feb 2021.
- ‘UK House Price Index: data downloads November 2020’. Gov.uk. 20 Jan 2021. Available at: https://www.gov.uk/government/statistical-data-sets/uk-house-price-index-datadownloads-november-2020. Accessed: 15 Feb 2021.