In this paper we will show how the basic tools of statistics and probability theory are applied to a real world problem. The purpose of the assignment is to predict the price of Bitcoin based on the gathered data. The research is based on a multiple regression analysis. There are several models are developed and the best is chosen as a result.
In 2009, someone using the pen-name Satoshi Nakamoto published a paper titled “Bitcoin: A Peer-to-Peer Electronic Cash System”. The goal of this new peer-to-peer(“P2P”) system was to help solve the problems of double spending and trust present in standard electronic commerce which relies on financial institutions to act as trusted third party in virtually all transactions. This paper and the finaincial system it established led to a series of fiat currencies with policy not under the influence of a government or financial institution but a publicly available algorithm. Unlike standard fiat currencies such as the dollar and euro Bitcoin was developed with no inherent value, was not backed by any government or institution. The question I seek to explore is if bitcoins are not attached to any particular economy then what determines its pricing (exchange rate)? Based on conflicting information from the few studies on this topic I have chose to base my model on the variables of supply and demand of money. The variables used to fill this model are total BTC in circulation, estimated daily transaction volume in both USD and BTC, and Bitcoin days destroyed. In an attempt to reconcile global factors into this model I included the OPEC price of oil in USD and the price per troy ounce of gold in USD.
As a relatively new phenomenon there is a fairly limited amount of research into the formation of the price of a bitcoin.
In his study “What can be expected of the Bitcoin?” Dennis van Wijk suggests that Bitcoins operate as a fiat currency under a finite economy. This would imply that the first person with the fiat currency would be strongly in favor since he can only gain, but subsequent trades and agents do not know their place in the economy, and the last person to trade and receive this money would suffer a loss since all other agents have already traded at a gain and would be unlikely to take the risk and addition cost of purchasing back from the last buyer. As such he argues that bitcoins are traded under a scenario elaborated by Kovenock and De Vires. They suggest that as long as the number of agents expands so much that the probability of being the last person to hold a fiat currency is small enough that the reward of trade is greater than the potential loss of accepting a fiat currency the problem of a finite fiat currency can be overcome.
Van Wijk argues that the only value of a bitcoin is due to the expectations of the agents but due to incomplete information on their and our parts in addition to uncertainty the expectations are not able to be measured or quantified so we must turn to macroeconomic and financial indicators to help uncover the expectations of agents through studying their impact on the value of a bitcoin. His analysis and regressions determined that several financial indicators including the value of the Dow Jones, the price of oil and the euro to dollar exchange rate have a significant impact on the value of a Bitcoin in the long run.
Pavel Ciaian in his article "The Economics of BitCoin Price Formation" shows that the market share of Bitcoin recently growing rapidly. This explains the rapid rise in prices for Bitcoin. In his article, the author examined the influence of factors on the price of Bitcoin in the short and long term. To assess the influence he used VAR estimate.
As a result of analysis Confirm that Bitcoin is exposed to market impact of prices. The process of forming prices for Bitcoin is similar to the formation of a currency's price. The main factor affecting the value of Bitcoin is supply and demand. The author notes that speculation also affects the value of Bitcoin. The positive influence of speculation is to reduce the specific risks associated with individual market participants. Speculation also allows you to maintain the liquidity of Bitcoin. The negative point is that the price of Bitcoin can cause price bubbles as a consequence of rapid growth of its prices.
Martis Buchholz in his study "Information, Price Volatility, and Demand for Bitcoin" used ARCH\GARCH models in exploration of Bitcoin prices. The author has studied the dynamics of prices of Bitcoin and came to the conclusion that market participants would consider price volatility as a positive development, because it allows to make more money, and fast. However, after the lapse of the peak price bubble, the market was in panic. Many have realized that they can lose their assets due to fluctuations in the price of Bitcoin. The market reacts quickly to reduce prices for Bitcoin. Thus, the price of Bitcoin before and after the price peak adequately reflects the impact of volatility in demand. As a result, studies have shown a good price situation in the market and the existence of Bitcoin price bubble.
It is known that multiple regression is widely used in the economy: in solving the problems of demand, stock returns, in the study of the functions of production costs in macroeconomic calculations, and a number of other issues of econometrics. Currently, multiple regression is one of the most common methods in econometrics. The main purpose of multiple regression is to build a model with a large number of factors, determining at the same time the influence of each of them separately, as well as their impact on the cumulative simulated index. Building a multiple regression equation begins with a decision on the specification of the model. It includes two sets of problems: the selection factors and chooses the type of regression equation. Inclusion in the multiple regression equation of a set of factors primarily due to the nature of the representation of researchers modeled the relationship with other economic indicators phenomena.
I initially ran a standard linear regression, which resulted in some significant results however there were issues in regard to autocorrelation, homoscedasticity, and normality.
It is known, log-linear model helps mitigate the impact of heteroscedasticity and problems with normal distribution. Resolving these issues I took the log of independent and dependent variables, and also included the differences of logs between the data of the day and previous day to reduce autocorrelation in the dependent variable. Including the day to day difference of the logs of variables in addition to the log of each variable allows us notice short run implications as well as long run. Bitcoin Days Destroyed is a measure that accounts for the amount of time between when a bitcoin is added to a wallet and when it is next spent. I used this measure as an inverse proxy for the velocity of bitcoins. The higher the value for Bitcoin Days destroyed implies that these coins were held for longer and therefore have a lower velocity. In a market at equilibrium there is an implied inverse relationship between the price of bitcoins as well as the quantity supplied, and velocity.
The following variables are participating in the research:
Total BTC in Circulation – the total number of bitcoints in the circulation in the system
BTC-Estimated Trade Volume USD – bitcoint total estimated trade volume, in US dollars
Estimated Transaction Volume BTC – bitcoint estimated transaction volume, in US dollars
OPEC oil Price USD – oil price, according to OPEC (per barrel, in US dollars)
Gold Price USD – price of gold (per ounce, in US dollars)
Bitcoin Daily MKT Price USD – the price of bitcoin, in US dollars
Bitcoin Days Destroyed – a type of weighting of bitcoint transactions where bitcoins that have been held longer are valued higher.
With the given variables, I consider “Bitcoin Daily MKT Price USD” as a dependent variable and other variables as independent.
The data was retrieved from the Bitcoin price data of Financial, Economics and Society Database (www.quandl.com) and the official log of bitcoin data (www.blockchain.info). The data set consists of 1304 observations, each observation is characterized by the listed variables on a daily basis (one observation represents the variables’ values in a particular day: from 22.11.2011 to 17.06.2015)
When starting to analyze the data, we always begin with a descriptive statistics. This statistics shows the basic measures of central tendency and variability of each variable and it helps to understand the distribution better.
For initial variables, the descriptive statistics is given in the table #1 and scatter plots in the Appendix.
The high values of skewness and kurtosis tell about the problems with the data distribution. It’s possible that some of my variables are not normally distributed. There are two common tests to check whether the data is normally distributed or not: Kolmogorov-Smirnov and Shapiro-Wilk. The result of both tests is given in the table #2 in Appendix.
The result of the testing is the following: Since all p-values are lesser than 0.001, the distribution of the data is significantly differs from the normal, for each variable and even at 1% level of significance.
Autocorrelation in the residuals is a violation of one of the main prerequisites for method of least squares - premise chance residuals from the regression equation. One possible solution to this problem is to apply to the evaluation of the model parameters for method of least squares. When building a multiple regression equation for the time series of data there is also the problem of multicollinearity factors included in the regression equation, if these factors include the trend.
The result of test is 0.379118 and it’s quite low, because as a rule of thumb, the value of Durbin-Watson statistics below 1.0 may signalize about the problem with autocorrelation.
The greatest difficulties in the use of the device multiple regression occur in the presence of multicollinearity factors when more than two factors are linked by a linear dependence. The presence of multicollinearity factors may mean that certain factors will always act in unison. As a result, the variation in the original data ceases to be completely independent, and it is impossible to evaluate the effect of each factor in isolation.
The stronger multicollinearity factors, the less reliable estimates of the distribution of the amount of variation explained by individual factors using the method of least squares.
Inclusion in the model multicollinear factors undesirable for the following reasons:
difficult interpretation of the parameters of the multiple regression; linear regression lose economic sense;
unreliable estimates of the parameters, show large standard errors and change with the volume of observations that makes the model unsuitable for analysis and forecasting
Eliminating the problem of multicollinearity factors can help the transition to a reduced form equations. For this purpose, the regression equation produced by the substitution of this factor from another expression of his equation.
One of the most common methods to determine the issue of multicollinearity is to find variance influence factor (VIF). As a rule of thumb, VIF>10 may be a signal of multicollinearity. In our case, all VIFs are lower than 10. There is no problem with multicollinearity here.
The obtained model is given in the table #3.
These issues should be avoided and I can make them less significant to change the form of the regression equation. Perform regression analysis for log-log model. Table #4 represents descriptive statistics for the new variables. Also, there are scatterplots of the pairs of variables in Appendix.
Now the Durbin-Watson statistics is 0.567256 and this is a better result.
There are still all p-values lower than 0.01, and I conclude that the data is not normally distributed.
VIFs for each variable are lower than 10, thus, there is no multicollinearity as well.
The model has the form like stated in the table #6 in Appendix.
However, there are some factors are insignificant, because their p-values are higher than 0.05
The first regression was developed for the initial data without taking logarithms. According to the ANOVA result, the overall regression is significant, the coefficients are jointly significant (F=880.31, p<0.001). Moreover, all coefficients were significant separately (each p-value lesser than 0.001). The obtained regression equation explained approximately 77.21% of the variance of the bitcoin price variable (R-squared = 0.772127). However, there is some problem with data. The data is not normally distributed (Kolmogorov-Smirnov and Shapiro-Wilk tests p-values all less than 0.05), there is also a problem with autcorrelation (Durbin-Watson statistics 0.379118). There is no multicollinearity indicated in the data.
The second regression includes logarithm values of the previous factors plus the differences of logs between the data of the day and previous day in order to reduce autocorrelation in the dependent variable. According to the ANOVA result, the overall regression is significant, the coefficients are jointly significant (F=981.0947, p<0.001).). The However, there are some variables are not separately significant. The p-values of 5 variables (look regression output) are higher than 0.05. The obtained regression equation explained approximately 92.73% of the variance of the bitcoin price variable (R-squared = 0.927301). However, the data is also not normally distributed (Kolmogorov-Smirnov and Shapiro-Wilk tests p-values all less than 0.05), the question with autocorrelation is quite reduced but still may be the problem (Durbin-Watson statistics is 0.567256, which is still lower than 1). There is also no multicollinearity issue in this case.
Summary and Conclusions
Finally, we have obtained the loglinear model which shows the relationship between the price of bitcoin and the other factors such as prices on oil, gold, the total number of BTC in circulation and difference factors. The loglinear model has fewer issues which may bias the result. Its ANOVA result is better, because F is higher. The coefficient of determination in this model is higher than in initial one, hence, more variance is being explained. Also, this model has the higher Durbin-Watson test statistics and the problem of autocorrelation is not such significant in this model compared to the initial one.
Buchholz,, Martis. Information, Price Volatility, and Demand for Bitcoin. 2012. Web. 25 June 2015. <http://academic.reed.edu/economics/parker/s12/312/finalproj/Bitcoin.pdf>.
Ciaian, Pavel. "The Economics of BitCoin Price Formation." Web. 25 June 2015.
Stock, James H., and Mark W. Watson. Introduction to Econometrics. 3rd ed. Boston: Addison-Wesley, 2011. Print.
Verbeek, Marno. A Guide to Modern Econometrics. Chichester: Wiley, 2000. Print.
Van Wijk, Dennis. "What Can Be Expected from the Bitcoin?" 18 July 2013. Web. 25 June 2015.
Variable N N* Mean SE Mean StDev Variance
Total BTC in Circulation 1304 0 11481745 50856 1837169 3,37519E+12
BTC-Estimated Trade Volu 1304 0 35639054 1309531 47306535 2,23791E+15
Estimated Transaction Vo 1304 0 208772 9567 345614 1,19449E+11
OPEC oil Price USD 1304 0 97,017 0,515 18,621 346,748
Gold Price USD 1304 0 1433,3 5,41 195,3 38132,0
Bitcion Daily MKT Price 1304 0 233,94 6,93 250,24 62618,68
BTC days destroyed 1304 0 5742621 262807 9493859 9,01334E+13
Variable Minimum Q1 Median Q3 Maximum
Total BTC in Circulation 7712350 10145675 11648650 13066863 14287425
BTC-Estimated Trade Volu 262169 2625503 24683670 50322894 578807930
Estimated Transaction Vo 39123 116408 165657 222871 5825066
OPEC oil Price USD 41,500 88,850 104,970 108,590 124,640
Gold Price USD 1142,0 1261,5 1373,5 1614,9 1895,0
Bitcion Daily MKT Price 2,29 11,96 126,43 388,00 1151,00
BTC days destroyed 0 2243474 3432415 5892877 173297972
Variable IQR Skewness Kurtosis
Total BTC in Circulation 2921188 -0,34 -0,97
BTC-Estimated Trade Volu 47697391 4,03 28,50
Estimated Transaction Vo 106463 11,07 141,91
OPEC oil Price USD 19,740 -1,22 0,52
Gold Price USD 353,4 0,37 -1,26
Bitcion Daily MKT Price 376,03 1,00 0,08
BTC days destroyed 3649404 9,94 139,09
Variable N N* Mean SE Mean StDev Variance
Log Total BTC in Circula 1304 0 16,243 0,00465 0,168 0,0282
Log BTC-Estimated Trade 1304 0 16,384 0,0481 1,737 3,017
Log Estimated Transactio 1304 0 12,007 0,0152 0,549 0,302
Log OPEC oil Price USD 1304 0 4,5520 0,00633 0,2285 0,0522
Log Gold Price USD 1304 0 7,2588 0,00372 0,1344 0,0181
Log Bitcion Daily MKT Pr 1304 0 4,3768 0,0506 1,8262 3,3350
Log BTc days detroyed 1303 1 15,167 0,0216 0,780 0,609
Diff log Total BTC in MK 1304 0 0,000472 0,000007 0,000237 0,000000
diffBTC log-Tradevol-usd 1304 0 0,0024 0,0103 0,3712 0,1378
diffEstimated log Transa 1304 0 -0,0012 0,0102 0,3672 0,1348
diff log OPEC 1304 0 -0,000122 0,000350 0,012657 0,000160
diff log gold 1304 0 -0,000049 0,000306 0,011063 0,000122
Diff MKT log price usd 1304 0 0,00357 0,00143 0,05149 0,00265
Diff log Bitcoin Days De 1302 2 -424 424 15292 233841993
Variable Minimum Q1 Median Q3 Maximum
Log Total BTC in Circula 15,859 16,133 16,271 16,386 16,475
Log BTC-Estimated Trade 12,477 14,779 17,023 17,734 20,176
Log Estimated Transactio 10,574 11,665 12,017 12,314 15,578
Log OPEC oil Price USD 3,7257 4,4879 4,6537 4,6876 4,8254
Log Gold Price USD 7,0405 7,1405 7,2251 7,3871 7,5470
Log Bitcion Daily MKT Pr 0,8544 2,4826 4,8400 5,9623 7,0484
Log BTc days detroyed 12,569 14,624 15,050 15,590 18,971
Diff log Total BTC in MK 0,000000 0,000298 0,000362 0,000705 0,001190
diffBTC log-Tradevol-usd -1,8718 -0,2260 -0,0028 0,2333 1,9670
diffEstimated log Transa -1,9049 -0,2315 -0,0043 0,2142 2,0229
diff log OPEC -0,077387 -0,006269 0,000095 0,006295 0,072274
diff log gold -0,095962 -0,005194 0,000000 0,005813 0,048387
Diff MKT log price usd -0,47831 -0,01296 0,00084 0,01788 0,35879
Diff log Bitcoin Days De -551781 -0 -0 0 4
Variable IQR Skewness Kurtosis
Log Total BTC in Circula 0,253 -0,58 -0,70
Log BTC-Estimated Trade 2,956 -0,52 -1,02
Log Estimated Transactio 0,649 1,41 7,45
Log OPEC oil Price USD 0,1996 -1,60 1,79
Log Gold Price USD 0,2466 0,25 -1,34
Log Bitcion Daily MKT Pr 3,4796 -0,39 -1,39
Log BTc days detroyed 0,965 0,85 1,36
Diff log Total BTC in MK 0,000406 0,98 -0,52
diffBTC log-Tradevol-usd 0,4593 0,10 1,92
diffEstimated log Transa 0,4457 0,09 2,16
diff log OPEC 0,012564 -0,02 4,72
diff log gold 0,011008 -0,76 7,34
Diff MKT log price usd 0,03084 -0,83 16,81
Diff log Bitcoin Days De 1 -36,08 1302,00