Now let's see what we should do if we consider the initial time series as non-stationary. As we mentioned in the first part of the exploration we have strong evidence for non stationarity cause the ACF is “decaying”, or decreasing, very slowly, and remains well above the significance range. This is indicative of a non-stationary series. Moreover, in the plotted data of the series plot we observe a singificant upword trend that also implies non stationarity.
The step that comes physically next is differantiation (yes ... as you can see we go for an arima model, and when we say arima we mean real arima with at least one degree of differantiation (d>=1) and not an "arima" like "yes but an ar model is in reality an arima(p,0,0)")
timeseries1['Daily_Rev First Difference'] = timeseries1['price'] - timeseries1['price'].shift(1)
timeseries1.head(10)
| price | Daily_Rev First Difference | |
|---|---|---|
| date | ||
| 2017-01-06 | 916.38 | NaN |
| 2017-01-07 | 1351.90 | 435.52 |
| 2017-01-08 | 709.58 | -642.32 |
| 2017-01-09 | 673.79 | -35.79 |
| 2017-01-10 | 1434.87 | 761.08 |
| 2017-01-11 | 2776.16 | 1341.29 |
| 2017-01-12 | 2234.58 | -541.58 |
| 2017-01-13 | 2505.58 | 271.00 |
| 2017-01-14 | 1112.69 | -1392.89 |
| 2017-01-15 | 2199.57 | 1086.88 |
import seaborn as sns
sns.set(rc={'figure.figsize':(15, 7)})
timeseries1['Daily_Rev First Difference'].plot(linewidth=1);
plt.ylabel('Total Daily Revenue First Differences')
Text(0, 0.5, 'Total Daily Revenue First Differences')
axes = timeseries1['Daily_Rev First Difference'].plot(marker='.', alpha=0.9, linestyle='None', figsize=(15, 7), subplots=True)
plt.ylabel('Total Daily Revenue First Differences')
Text(0, 0.5, 'Total Daily Revenue First Differences')
# Again testing if data is stationary
adfuller_test(timeseries1['Daily_Rev First Difference'].dropna())
ADF Test Statistic : -6.573512146103036 p-value : 7.82861871958829e-09 #Lags Used : 19 Number of Observations : 580 strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary
Now we have more reasons to believe the outcome of the Dickey-Fuller test!! Our series is significantly smoother as the trend has been eliminated. We can also see signs of heteroskedacity, but for now we will pretend that we do not see it. Let's see if the ACF and PACF will make us happy or if we have to try more transformations! (Well, I can also see a seasonal pattern here but until we go below lets keep the hope alive)
plot_acf(timeseries1['Daily_Rev First Difference'].dropna())
plot_pacf(timeseries1['Daily_Rev First Difference'].dropna())
In the ACF plot it is clear that the autocorrelation function is decreasing fast forward to zero (as a stationary ma process should do) with two maybe three significant spikes. Also it is clear that we have seasonality as there is a significant value in lag 7,14,21 etc (our seasonality index is s=7) so we are talking about a weekly seasonality. Regarding the PACF plot seems to be sinusoidal periodical. Taking all these into account we proceed by eliminating the seasonality.
timeseries1['Daily_Rev Seasonal First Difference']=timeseries1['Daily_Rev First Difference']-timeseries1['Daily_Rev First Difference'].shift(7)
timeseries1.head(20)
| price | Daily_Rev First Difference | Daily_Rev Seasonal First Difference | |
|---|---|---|---|
| date | |||
| 2017-01-06 | 916.38 | NaN | NaN |
| 2017-01-07 | 1351.90 | 435.52 | NaN |
| 2017-01-08 | 709.58 | -642.32 | NaN |
| 2017-01-09 | 673.79 | -35.79 | NaN |
| 2017-01-10 | 1434.87 | 761.08 | NaN |
| 2017-01-11 | 2776.16 | 1341.29 | NaN |
| 2017-01-12 | 2234.58 | -541.58 | NaN |
| 2017-01-13 | 2505.58 | 271.00 | NaN |
| 2017-01-14 | 1112.69 | -1392.89 | -1828.41 |
| 2017-01-15 | 2199.57 | 1086.88 | 1729.20 |
| 2017-01-16 | 3307.92 | 1108.35 | 1144.14 |
| 2017-01-17 | 3302.53 | -5.39 | -766.47 |
| 2017-01-18 | 4032.36 | 729.83 | -611.46 |
| 2017-01-19 | 3553.95 | -478.41 | 63.17 |
| 2017-01-20 | 4035.83 | 481.88 | 210.88 |
| 2017-01-21 | 2458.60 | -1577.23 | -184.34 |
| 2017-01-22 | 3619.27 | 1160.67 | 73.79 |
| 2017-01-23 | 6581.47 | 2962.20 | 1853.85 |
| 2017-01-24 | 5955.17 | -626.30 | -620.91 |
| 2017-01-25 | 8977.23 | 3022.06 | 2292.23 |
timeseries1['Daily_Rev Seasonal First Difference'].plot(linewidth=1);
plt.ylabel('Total Daily Revenue Seasonal First Differences')
Text(0, 0.5, 'Total Daily Revenue Seasonal First Differences')
axes = timeseries1['Daily_Rev Seasonal First Difference'].plot(marker='.', alpha=0.9, linestyle='None', figsize=(15, 7), subplots=True)
plt.ylabel('Total Daily Revenue Seasonal First Differences')
Text(0, 0.5, 'Total Daily Revenue Seasonal First Differences')
# Again testing if data is stationary
adfuller_test(timeseries1['Daily_Rev Seasonal First Difference'].dropna())
ADF Test Statistic : -6.773679163930242 p-value : 2.603152947688844e-09 #Lags Used : 19 Number of Observations : 573 strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary
plot_acf(timeseries1['Daily_Rev Seasonal First Difference'].dropna())
plot_pacf(timeseries1['Daily_Rev Seasonal First Difference'].dropna())
Now everything is more clear! In the ACF plot it is clear that the autocorrelation function is decreasing fast forward to zero (as a stationary ma process should do) with five maybe six significant spikes. Also the seasonality has been eliminated (so we will keep the seasonal index equal to seven - s=7. Regarding the PACF plot seems to be sinusoidal periodical again with one or two significant spikes. Taking all these into consideration, in order to explain the evolution of the total daily revenue, a seasonal arima process (SARIMA) should be used. In order to find the best model we will use the following algorithm:
1)Construct a model for the data, starting with SARIMA(2,1,6)x(0,1,0)7
2)Estimate the parameters and calculate the model residuals along with numerical indicators (AIC,BIC, p values for Ljung-Box statistic )
3)Check the significance of the added parameters and if there is any not significant remove it (one at a time)
4)Check of the residuals (ACF, PACF, Quantile-Quantile Plots, histogram of the residuals)
5)If the residuals cannot be assimilated to a white noise, check the ACF and PACF in case there is significant lags to add to the model
6)In case all the parameters are significant and there is no more significant lags try different combinations by adding parameters to create new models and compare the models established using numerical indicators (those in step 2)
7)Stop when the models created instead of getting better become worse (for example if we see that by adding P or p parameters the residuals go better until k parameters and after that they go worse instead of getting better we stop)