113

GitHub - medvedev1088/ethereum-etl: Python scripts for ETL (extract, transform a...

 5 years ago
source link: https://github.com/medvedev1088/ethereum-etl
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md

Ethereum ETL

Join the chat at https://gitter.im/ethereum-eth Build Status

Export blocks and transactions (Schema, Reference):

> python export_blocks_and_transactions.py --start-block 0 --end-block 500000 \
--provider-uri https://mainnet.infura.io --blocks-output blocks.csv --transactions-output transactions.csv

Export ERC20 and ERC721 transfers (Schema, Reference):

> python export_token_transfers.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output token_transfers.csv

Export receipts and logs (Schema, Reference):

> python export_receipts_and_logs.py --transaction-hashes transaction_hashes.txt \
--provider-uri https://mainnet.infura.io --receipts-output receipts.csv --logs-output logs.csv

Export ERC20 and ERC721 token details (Schema, Reference):

> python export_tokens.py --token-addresses token_addresses.csv \
--provider-uri https://mainnet.infura.io --output tokens.csv

LIMITATIONS

Table of Contents

Schema

blocks.csv

Column Type number bigint hash hex_string parent_hash hex_string nonce hex_string sha3_uncles hex_string logs_bloom hex_string transactions_root hex_string state_root hex_string receipts_root hex_string miner address difficulty numeric total_difficulty numeric size bigint extra_data hex_string gas_limit bigint gas_used bigint timestamp bigint transaction_count bigint

transactions.csv

Column Type hash hex_string nonce bigint block_hash hex_string block_number bigint transaction_index bigint from_address address to_address address value numeric gas bigint gas_price bigint input hex_string

token_transfers.csv

Column Type token_address address from_address address to_address address value numeric transaction_hash hex_string log_index bigint block_number bigint

receipts.csv

Column Type transaction_hash hex_string transaction_index bigint block_hash hex_string block_number bigint cumulative_gas_used bigint gas_used bigint contract_address address root hex_string status bigint

logs.csv

Column Type log_index bigint transaction_hash hex_string transaction_index bigint block_hash hex_string block_number bigint address address data hex_string topics string

contracts.csv

Column Type address address bytecode hex_string function_sighashes string is_erc20 boolean is_erc721 boolean

tokens.csv

Column Type address address symbol string name string decimals bigint total_supply numeric

You can find column descriptions in https://github.com/medvedev1088/ethereum-etl-airflow

Note: for the address type all hex characters are lower-cased. boolean type can have 2 values: True or False.

LIMITATIONS

  • contracts.csv and tokens.csv files don’t include contracts created by message calls (a.k.a. internal transactions). We are working on adding support for those.
  • In case the contract is a proxy, which forwards all calls to a delegate, interface detection doesn’t work, which means is_erc20 and is_erc721 will always be false for proxy contracts.
  • The metadata methods (symbol, name, decimals, total_supply) for ERC20 are optional, so around 10% of the contracts are missing this data. Also some contracts (EOS) implement these methods but with wrong return type, so the metadata columns are missing in this case as well.
  • token_transfers.value, tokens.decimals and tokens.total_supply have type STRING in BigQuery tables, because numeric types there can't handle 32-byte integers. You should use cast(value as FLOAT64) (possible loss of precision) or safe_cast(value as NUMERIC) (possible overflow) to convert to numbers.

Exporting the Blockchain

  1. Install python 3.5 or 3.6 https://www.python.org/downloads/

  2. You can use Infura if you don't need ERC20 transfers (Infura doesn't support eth_getFilterLogs JSON RPC method). For that use -p https://mainnet.infura.io option for the commands below. If you need ERC20 transfers or want to export the data ~40 times faster, you will need to set up a local Ethereum node:

  3. Install geth https://github.com/ethereum/go-ethereum/wiki/Installing-Geth

  4. Start geth. Make sure it downloaded the blocks that you need by executing eth.syncing in the JS console. You can export blocks below currentBlock, there is no need to wait until the full sync as the state is not needed (unless you also need contracts bytecode and token details). You can export blocks below currentBlock, there is no need to wait until the full sync as the state is not needed.

  5. Clone Ethereum ETL and install the dependencies:

    > git clone https://github.com/medvedev1088/ethereum-etl.git
    > cd ethereum-etl
    > pip install -r requirements.txt
  6. Export all:

    > ./export_all.sh -h
    Usage: ./export_all.sh -s <start_block> -e <end_block> -b <batch_size> -p <provider_uri> [-o <output_dir>]
    > ./export_all.sh -s 0 -e 5499999 -b 100000 -p file://$HOME/Library/Ethereum/geth.ipc -o output

    The result will be in the output subdirectory, partitioned in Hive style:

    output/blocks/start_block=00000000/end_block=00099999/blocks_00000000_00099999.csv
    output/blocks/start_block=00100000/end_block=00199999/blocks_00100000_00199999.csv
    ...
    output/transactions/start_block=00000000/end_block=00099999/transactions_00000000_00099999.csv
    ...
    output/token_transfers/start_block=00000000/end_block=00099999/token_transfers_00000000_00099999.csv
    ...

Should work with geth and parity, on Linux, Mac, Windows. If you use Parity you should disable warp mode with --no-warp option because warp mode does not place all of the block or receipt data into the database https://wiki.parity.io/Getting-Synced Tested with Python 3.6, geth 1.8.7, Ubuntu 16.04.4

If you see weird behavior, e.g. wrong number of rows in the CSV files or corrupted files, check this issue: https://github.com/medvedev1088/ethereum-etl/issues/28

Export in 2 Hours

You can use AWS Auto Scaling and Data Pipeline to reduce the exporting time to a few hours. Read this article for details https://medium.com/@medvedev1088/how-to-export-the-entire-ethereum-blockchain-to-csv-in-2-hours-for-10-69fef511e9a2

Running in Windows

Additional steps:

  1. Install Visual C++ Build Tools https://landinghub.visualstudio.com/visual-cpp-build-tools

  2. Install Git Bash with Git for Windows https://git-scm.com/download/win

  3. Run in Git Bash:

    >  ./export_all.sh -s 0 -e 999999 -b 100000 -p 'file:\\\\.\pipe\geth.ipc' -o output

Command Reference

All the commands accept -h parameter for help, e.g.:

> python export_blocks_and_transactions.py -h

usage: export_blocks_and_transactions.py [-h] [-s START_BLOCK] -e END_BLOCK
                                         [-b BATCH_SIZE] --provider-uri PROVIDER_URI
                                         [-w MAX_WORKERS]
                                         [--blocks-output BLOCKS_OUTPUT]
                                         [--transactions-output TRANSACTIONS_OUTPUT]

Export blocks and transactions.

For the --output parameters the supported types are csv and json. The format type is inferred from the output file name.

export_blocks_and_transactions.py
> python export_blocks_and_transactions.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --blocks-output blocks.csv --transactions-output transactions.csv

Omit --blocks-output or --transactions-output options if you want to export only transactions/blocks.

You can tune --batch-size, --max-workers for performance.

export_token_transfers.py

The API used in this command is not supported by Infura, so you will need a local node. If you want to use Infura for exporting ERC20 transfers refer to extract_token_transfers.py

> python export_token_transfers.py --start-block 0 --end-block 500000 \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --batch-size 100 --output token_transfers.csv

Include --tokens <token1> <token2> to filter only certain tokens, e.g.

> python export_token_transfers.py --start-block 0 --end-block 500000 --provider-uri file://$HOME/Library/Ethereum/geth.ipc \
--output token_transfers.csv --tokens 0x86fa049857e0209aa7d9e616f7eb3b3b78ecfdb0 0x06012c8cf97bead5deae237070f9587f8e7a266d

You can tune --batch-size, --max-workers for performance.

export_receipts_and_logs.py

First extract transaction hashes from transactions.csv (Exported with export_blocks_and_transactions.py):

> python extract_csv_column.py --input transactions.csv --column transaction_hash --output transaction_hashes.txt

Then export receipts and logs:

> python export_receipts_and_logs.py --transaction-hashes transaction_hashes.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --receipts-output receipts.csv --logs-output logs.csv

Omit --receipts-output or --logs-output options if you want to export only logs/receipts.

You can tune --batch-size, --max-workers for performance.

Upvote this feature request https://github.com/paritytech/parity/issues/9075, it will make receipts and logs export much faster.

extract_token_transfers.py

First export receipt logs with export_receipts_and_logs.py.

Then extract transfers from the logs.csv file:

> python extract_token_transfers.py --logs logs.csv --output token_transfers.csv

You can tune --batch-size, --max-workers for performance.

export_contracts.py

First extract contract addresses from receipts.csv (Exported with export_receipts_and_logs.py):

> python extract_csv_column.py --input receipts.csv --column contract_address --output contract_addresses.txt

Then export contracts:

> python export_contracts.py --contract-addresses contract_addresses.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output contracts.csv

You can tune --batch-size, --max-workers for performance.

export_tokens.py

First extract token addresses from contracts.json (Exported with export_contracts.py):

> python filter_items.py -i contracts.json -p "item['is_erc20'] or item['is_erc721']" | \
python extract_field.py -f address -o token_addresses.txt

Then export ERC20 / ERC721 tokens:

> python export_tokens.py --token-addresses token_addresses.txt \
--provider-uri file://$HOME/Library/Ethereum/geth.ipc --output tokens.csv

You can tune --max-workers for performance.

get_block_range_for_date.py
> python get_block_range_for_date.py --provider-uri=https://mainnet.infura.io --date 2018-01-01
4832686,4838611
get_keccak_hash.py
> python get_keccak_hash.py -i "transfer(address,uint256)"
0xa9059cbb2ab09eb219583f4a59a5d0623ade346d962bcd4e46b11da047c9049b

Running Tests

> export ETHEREUM_ETL_RUN_SLOW_TESTS=True
> pytest -vv

Querying in Amazon Athena

  • Upload the files to S3:
> cd output
> aws s3 sync . s3://<your_bucket>/ethereumetl/export --region ap-southeast-1
CREATE DATABASE ethereumetl;

Tables for Parquet Files

Read this article on how to convert CSVs to Parquet https://medium.com/@medvedev1088/converting-ethereum-etl-files-to-parquet-399e048ddd30

Note that DECIMAL type is limited to 38 digits in Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-decimal so values greater than 38 decimals will be null.

Querying in Google BigQuery

Refer to https://github.com/medvedev1088/ethereum-etl-airflow for the instructions.

Public Dataset

You can query the data that's updated daily in the public BigQuery dataset https://medium.com/@medvedev1088/ethereum-blockchain-on-google-bigquery-283fb300f579


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK