Querying Bitcoin data using RDF and AllegroGraph
Introduction
Bitcoin operates by maintaining the data about all value transfers in a single database, whose integrity and consistency is ensured by a set of cryptographic protocols, the most important of which is called PoW (Proof-of-Work). This data forms a graph with two general types of relations:
- block membership relations introduce ordering into the transaction data by building a timestamped sequence (chain) of groups (blocks) of transactions, and are merely used to verify the integrity and consistency of the database according to the network consensus rules;
- transaction chaining relations provide monetary properties for Bitcoin system by creating links between chunks of value, tracking their movement from the moment of generation during mining to point in time when they are used.
By definition, Bitcoin chain data is an immutable set of public records that represent its complete transaction history, which makes it an important global resource, while its transparent graph structure makes it an easy subject of various analysis methods. These methods are actively researched from two opposite points of view: as means of undermining privacy of the transaction data (see Bitfury’s Crystal) and as means of protecting against that (see CoinJoin and Samourai’s Whirlpool technologies).
One of the simplest means of graph analysis is graph querying with specialized querying languages like SPARQL. Moreover, there exists a set of powerful standard tools for representing and exploring graph data with such languages: RDF - Resource Description Framework.
In order to be able to use SPARQL for loading and querying Bitcoin database, we need to build an RDF model which defines the properties of Bitcoin entities as well as relations between them. To make our approach as general as possible, we will model Bitcoin data directly from its native representation. This model can then be used as a guideline for converting the Bitcoin data into a set of triples suitable for storing in a graph database. We will be using AllegroGraph for the purpose of storing the triples and as a querying engine.
Block data model
In all Turtle examples we assume the following standard namespaces are available:
RDF(http://www.w3.org/1999/02/22-rdf-syntax-ns#);RDFS(http://www.w3.org/2000/01/rdf-schema#);XSD(http://www.w3.org/2001/XMLSchema#);OWL(http://www.w3.org/2002/07/owl#).
These resources together form a powerful standard language for describing RDF models and in particular provide all necessary definitions to build the Bitcoin model.
Bitcoin block is a container for transactions, whose main consensus property is a hash of the previous block (some properties omitted for simplicy):
<Block>
rdf:type rdfs:Class;
<hash>
rdf:type rdfs:Property;
rdfs:comment "Hex-encoded double-hash of the block header that is used in PoW to chain blocks.";
rdfs:domain <Block>;
rdfs:range xsd:string.
<height>
rdf:type rdfs:Property;
rdfs:comment "Height of the block is an index of the block in the chain, starting from 0.";
rdfs:domain <Block>;
rdfs:range xsd:integer.
<transaction>
rdf:type rdfs:Property;
rdfs:comment "Transaction-Block membership relation.";
rdfs:domain <Block>;
rdfs:range <Transaction>.
<prevBlock>
rdf:type rdfs:Property;
rdfs:comment "Introduce ordering into set of Bitcoin blocks.";
rdfs:domain <Block>;
rdfs:range <Block>.
<nextBlock>
rdf:type rdfs:Property;
rdfs:comment "Helper inverse of prevBlock property.";
rdfs:domain <Block>;
rdfs:range <Block>;
owl:inverseOf <prevBlock>.
Value in Bitcoin is represented by a set of entities called transaction
outputs - pairs (<amount> <lock script>) where <lock script> is a certain
condition expressed in a simple stack language called Bitcoin script:
<TxOutput>
rdf:type rdfs:Class;
<amount>
rdf:type rdfs:Property;
rdfs:comment "Amount of atomic units locked in this UTXO.";
rdfs:domain <TxOutput>;
rdfs:range xsd:integer.
<lockScript>
rdf:type rdfs:Property;
rdfs:comment "Script that describes an unlocking condition of the given output.";
rdfs:domain <TxOutput>;
rdfs:range <Script>.
Value from an output can be transfered by providing an entity called
transaction input - a triple (<transaction id> <output index> <unlock
script>) such that <lock script> and <unlock script> when combined, must
evaluate to 1, forming a proof of ownership over the given value (some
properties omitted for brevity):
<TxInput>
rdf:type rdfs:Class;
<outputHash>
rdf:type rdfs:Property;
rdfs:comment "Hash of the transaction whose output is spent by this input.";
rdfs:domain <TxInput>;
rdfs:range xsd:string.
<outputIndex>
rdf:type rdfs:Property;
rdfs:comment "Index of the transcation output which is spent by this input.";
rdfs:domain <TxInput>;
rdfs:range xsd:integer.
<unlockScript>
rdf:type rdfs:Property;
rdfs:comment "Script that is appended to locking script in order to unlock the UTXO.";
rdfs:domain <TxInput>;
rdfs:range <Script>.
All the value in the Bitcoin system is contained in a UTXO (Unspent Transaction Output) set - a set of transaction outputs, for which no inputs exist within the system so far.
Finally, Bitcoin transaction is just a special record that represents a destruction of a subset of existing UTXOs and creation of new UTXOs under the condition that the sum of the amounts in the destroyed UTXOs is larger than or equal to the sum of created ones (some properties omitted for brevity):
<Transaction>
rdf:type rdfs:Class;
<txid>
rdf:type rdfs:Property;
rdfs:comment "Hash of the transaction which is used to uniquely identify it.";
rdfs:domain <Transaction>;
rdfs:range xsd:string.
<input>
rdf:type rdfs:Property;
rdfs:comment "Input-Transaction membership relation.";
rdfs:domain <Transaction>;
rdfs:range <Input>.
<output>
rdf:type rdfs:Property;
rdfs:comment "Output-Transaction membership relation.";
rdfs:domain <Transaction>;
rdfs:range <Output>.
The following Turtle example demonstrates how this RDF model can be used to represent complete chain database (block given is a genesis block - the first block in the mainnet Bitcoin chain; script strings omitted for brevity):
@prefix : <https://raw.githubusercontent.com/franzinc/agraph-examples/master/data/bitcoin/model.ttl#>
@prefix btc: <bitcoin://>
btc:blk0
:height 0;
:hash "000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f";
:time 1231006505;
:version 1;
:transaction btc:4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b.
btc:4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b
:lockTime 0;
:input [:unlockScript "...".];
:output [:amount 5000000000; :lockScript "...".].
Loading Bitcoin data
In order to be able to run queries, we first need to prepare a repository with
chain data. We will be using a simple Python tool for loading the chain data
into the AllegroGraph instance, which can be found in the
data/bitcoin directory of the
agraph-examples repository on GitHub along with the model
we described above.
The following examples assume AllegroGraph triple store and assume it is already installed and running on the target machine. The following AG instance settings settings are assumed as well:
- host:
localhost(default); - port:
10035(default); - username:
aguser; - password:
agpassword.
We need to make sure we have an access to a running bitcoind instance with RPC
port open. We assume following bitcoind settings:
- host:
localhost(default); - port:
8332(default); - username:
btcuser; - password:
btcpassword.
First, we will install the tool by cloning the
agraph-examples repository, setting up a virtual
environment and installing the dependencies:
git clone https://github.com/franzinc/agraph-examples
cd agraph-examples/data/bitcoin
python3 -m venv .
source ./bin/activate
pip3 install -r requirements.txt
Now we are good to go. The following command starts the process of loading
bitcoin chain data into a clean AG repository named bitcoin using 4 loader
processes:
./convert.py \
--source=http://btcuser:btcpassword@localhost:8332 \
--destination=http://aguser:agpassword@localhost:10035 \
--name=bitcoin \
--workers=4 \
--clear
Note that loading the whole Bitcoin chain takes quite a lot of time and space. Using the following setup:
-
machine running the loader - 2 x 4-core Intel(R) Xeon(R) L5420 @ 2.50GHz, 32 Gb of memory,
-
machine running the AllegroGraph instance - 2 x 6-core AMD Opteron(tm) 2439 @ 2.80GHz SE, 64 Gb of memory,
we were able to load 80% of all chain data in approximately 14 days. The disk
space required for the 80% chain database, which contains 11.5 billion triples,
is 1.1 Tb, around 4 times more than for a raw binary Bitcoin database maintained
by the bitcoind node, which at the moment of writing takes around 270
Gb. Given these numbers, for experimenting purposes, it might make sense to load
only a subset of blocks we are interested in by using the
--start-height/--end-height parameters for the convert tool:
./convert.py \
--source=http://btcuser:btcpassword@localhost:8332 \
--destination=http://aguser:agpassword@localhost:10035 \
--name=bitcoin \
--workers=4 \
--clear \
--start-height=570000 \
--end-height=580000
Example queries
Following are the examples of using SPARQL to extract different information about block data:
- number of known blocks:
PREFIX : <https://raw.githubusercontent.com/franzinc/agraph-examples/master/data/bitcoin/model.ttl#> SELECT (COUNT(*) AS ?count) WHERE { ?b a btcm:Block. } - total number of transactions:
PREFIX : <https://raw.githubusercontent.com/franzinc/agraph-examples/master/data/bitcoin/model.ttl#> SELECT (COUNT(*) AS ?count) WHERE { ?tx a btcm:Transaction. } - transaction in block 570001:
PREFIX : <https://raw.githubusercontent.com/franzinc/agraph-examples/master/data/bitcoin/model.ttl#> SELECT ?txid WHERE { ?b a :Block. ?b :height "570001"^^xsd:int. ?b :transaction ?tx. ?tx :txid ?txid. } - transactions sending more than 1000 BTC:
PREFIX : <https://raw.githubusercontent.com/franzinc/agraph-examples/master/data/bitcoin/model.ttl#> SELECT ?tx WHERE { ?b a :Block. ?b :transaction ?tx. ?tx :output ?out. ?out :amount ?amt. } GROUP BY ?tx HAVING (SUM(?amt) > 100000000000) - transactions sending BTC to Pirate Bay’s address:
PREFIX : <https://raw.githubusercontent.com/franzinc/agraph-examples/master/data/bitcoin/model.ttl#> SELECT ?tx WHERE { ?tx :output ?out. ?out :lockScript ?s. FILTER REGEX (?s, "<tpb address>"). }