“The world’s most valuable resource is no longer oil but data.”  This holds true even for the finance industry. The control that financial companies wield over their data gives them enormous power, and the abundance and quality of data they use changes the very nature of the competition. According to Bloomberg, the “financial sector is adopting big data analytics to maintain a competitive advantage in the trading environment” . Quantitative- and high-frequency trading are ubiquitous, indispensable tools in current times, and their full value in cryptocurrency trading are being realized. A key aspect that is still often overlooked in quantitative crypto-trading is the quality of the data being used to design sophisticated prediction models.
In this era of cryptocurrency trading, those with the most data of the highest quality will surely win. In algorithmic trading applications, accuracy is one of the best quality indicators of a data source. It determines the execution prices, the model’s behaviour, and the model’s ability to fit the market efficiently and effectively. In the extreme case, high frequency traders care about order-by-order data to simulate precise market-making algorithms. In order to accurately determine what and how much to trade at a low cost, traders desire the finest scales of accurate data with low latency.
Many algorithmic traders incorporate massive amounts of data into their algorithms to create better pricing models and leverage large volumes of historical data to backtest their trading algorithms. Particularly with recent advances in machine learning, the data-driven approach to modelling is being emphasized more than ever before. Market behaviours are learned from black box models that recognize patterns in big data. This means that the accuracy of the data affects what the model learns and predicts. Thus, the more accurate data you have, the better you can simulate execution quality in algorithms.
Available Sources of High Quality Crypto-Trading Data
There are several companies that provide cryptocurrency market data. Kaiko, CoinAPI, and Coinscious are three well-known crypto data vendors. Most of these companies offer live and historical trading, order books, and OHLCV1 (open, high, low, close, volume) cryptocurrencies. However, what remains unknown, until now, is the quality of data these companies claim to provide. Therefore, the key question is: which data vendor has the highest quality data for you to gain a competitive edge?
A simple way to assess data quality is to compare the exchange’s OHLCV data with derived OHLCV data. In the analysis below, the hourly level OHLCV data is computed for December 2018 amongst different data vendors. The error rates were measured over eight well-known exchanges: Binance, Bittrex, Bitfinex, Bitstamp, Bitmex, Huobi Global, Okex, and Coinbase Pro.
Figure 1. OHLC error rates for Bitcoin (BTC), Ethereum (ETH), and Ripple (XRP)2. Given that our budget limits us to purchase just one dataset between Kaiko and CoinAPI, we chose the more expensive one: Kaiko’s data
Figure 2. OHLC error rates for OHLC error rates for four alternative coins (ADA, XLM, TRX, ZRX)
Coinscious data proves to be the most accurate among these data vendors for the top 3 coins (BTC, ETH, and XRP). In average, Coinscious data are 38% better than Kaiko’s data, where the relative errors on OHLC are 39%, 41%, 31%, and 37% respectively (see Figure 1). Similar results have also been shown using four alternative coins (ADA, XLM, TRX, ZRX). Surprisingly, even though Kaiko data is accurate for high and low prices, their open and close prices are quite divergent when compared to Coinscious and CoinAPI.
Error In Trading Volume
In Figure 3 and Figure 4, volume error rates over time reveal the dates when the higher error rates occur. The spike in volume error rates occurs in two scenarios; the first scenario occurs when the volume and volume error rates spikes simultaneously, whereas the second scenario occurs when the volume error rates spike, but volume does not. The former can be attributed to increased latency on exchanges as traffic increases, whereas the latter can be attributed to internal server issues.
Figure 3. Absolute error between exchange volumes versus data vendors’ volumes in December 2018 (the lower, the better). The errors were measured for BTC/USD, ETH/USD, and XRP/USD on the top 7 exchanges3.
Coinscious’ error rates remain relatively low compared to other vendors’ error rates. Overall, it is clear that Coinscious data has the lowest error rates with respect to volume data.
Figure 4. Absolute distance error between exchange volumes versus data vendors’ volumes in December 2018 (the lower, the better). The errors were measured for the following alternative coins: ADA/USD at Bittrex, XLM/USD, TRX/USD, and ZRX/USD at Bitfinex.
The volume quality for alternative coins (i.e., altcoins) was also considered. Eight altcoins were randomly selected from different exchanges, including NEO, TRON, XLM, EOS, LTC, ZRX, and ADA. From the figure above, CoinAPI does not perform well on volumes with respect to these altcoins.
Reason For Data Discrepancies Between Vendors
Now you must be wondering, if the exchange provides public API, why would you need to purchase data? Firstly, public APIs have limited histories of information they provide, and unless a trader has stored historical price data, they would need to gather it from a third-party source. Secondly, even though exchanges provide public APIs, aggregating and preprocessing all possible cryptocurrency pairs for different exchanges is cumbersome, and arguably the most tedious step in developing a trading system. This is especially the case as the data receiving intervals gets coarser as the number of requests for data grows. It is for these reasons that the aforementioned data vendors exist.
More importantly, why do discrepancies in the accuracies exist across different data vendors? There are several possible reasons. It could be due to downtimes of exchange APIs. Or, given the thousands of combinations of cryptocurrency exchanges and trade pairs, there exist API rate limits on all cryptocurrency exchanges, and therefore a large number of data collection clients and complicated infrastructure is required.
While many companies are collecting vast amounts of data across different exchanges and coins, the quality of the data may be hidden underneath the quantity of the data. Especially in this era of a data-driven finance world, success and risk can be heavily dependent on the data quality and the data operations environment. Obtaining the right trading tools and hiring talented traders can certainly help, but even then, tools and people cannot guarantee success if the data is flawed. The cryptocurrency finance market definitely could benefit from having more of data quality analysis in order to understand the granular level of datasets and where they can obtain them.
- Open, high, low, close, volume (OHLCV) prices.
- Given that our budget limits us to purchase just one dataset between Kaiko and CoinAPI, we chose the more expensive one: Kaiko’s data.
- Top 7 exchanges include: Binance, HuobiPro, Bitfinex, Bitmex, OKEx, Bitstamp, and Coinbase.
 “The world’s most valuable resource is no longer oil, but data”. The Economist, 6 May 2017, https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data
 “3 ways big data is changing financial trading”. Bloomberg, 5 July 2017, https://www.bloomberg.com/professional/blog/3-ways-big-data-changing-financial-trading/