Classification Engine
Once raw blockchain activity has been ingested and written into each user’s Tax Graph, the next challenge is to translate that low-level data into events that accountants and tax authorities actually recognise. The Classification Engine (CE) is the subsystem that performs this translation. It assigns every Atomic Tax Unit (ATU)a precise economic meaning—trade, staking reward, LP deposit, DAO distribution, NFT sale, and so on—together with a machine-readable confidence score. What follows is an in-depth, technical look at how the CE works.
1. High-Level Architecture
┌──────────────┐ TxEvents ┌──────────────────┐
│ Listener │ ───────────────► │ Canonicaliser │
└──────────────┘ └────────┬─────────┘
│ Normalised traces
▼
┌──────────────────┐
│ Feature Builder │
└────────┬─────────┘
│ Dense feature vectors
▼
┌───────── Heuristic Ruleset ─────────┐
│ │
└────────┬────────────────────────────┘
▼
ML Classifier
│
▼
┌────────────────────┐
│ Conflict Resolver │
└────────┬───────────┘
│ ATU + tags + score
▼
Jurisdiction Mapper
Canonicaliser normalises traces coming from heterogeneous chains.
Feature Builder extracts ~250 deterministic features (op-codes, calldata hashes, balance deltas, oracle prices, gas patterns, etc.).
Heuristic Ruleset applies fast, hand-crafted signatures for well-known protocols.
ML Classifier handles everything not caught by heuristics, using gradient-boosted trees and a small transformer for calldata embeddings.
Conflict Resolver merges results, chooses the dominant tag(s), and outputs confidence scores.
Jurisdiction Mapper converts generic tags into jurisdiction-specific tax categories (e.g., “UK: S.104 Pool” vs. “DE: private Veräußerungsgeschäft”).
2. Canonicalisation Layer
Different chains expose traces in different formats (EVM call-trees, Solana inner-instructions, Cosmos ABCI events). The Canonicaliser converts them into TaxChain IL (Intermediate Language):
CALL 0xUniswapV3Router method=swapExactTokensForETH
TRANSFER USDC 1500 → 0xRouter
TRANSFER ETH 0.71 ← 0xWETH
EMIT PairSync(...)
A hash of the IL representation becomes the classification fingerprint, ensuring idempotent processing and deterministic re-runs.
3. Feature Builder
Key engineered features include:
Contract fingerprints
4-byte method ID, byte-grind hash of first 128 B
Token metadata
ERC-20 symbol/decimals, Coingecko categories
Flow metrics
In/out token ratio, net USD exposure change
Temporal context
Blocktime, slot, DEX TWAP position
Relational signals
Prior interactions with address X in last N txs
Gas patterns
First 16 op-codes, intrinsic gas / calldata size
These 250 features occupy <800 bytes per ATU after Run-Length + ZSTD compression.
4. Heuristic Rule Layer
Fast path for deterministic matches (≈ 70 % of traffic):
H-A1
methodId == 0xe4e2b673
(Uniswap V3 exactInput)
Swap / Trade
H-B7
Event Transfer
where from == 0x0…0
Token Mint / Airdrop
H-D2
to == 0xLido
AND calldata[0] == submit(bytes32)
Staking Deposit
H-E6
Solana inner-instruction Withdraw
on raydium::amm
LP Withdraw
Rules are encoded in a DSL stored on-chain; DAO governance can hot-patch without redeploying contracts.
5. Machine-Learning Classifier
Any ATU not labelled by heuristics flows into an XGBoost ensemble (200 trees, max-depth 9) augmented by a Mini-Transformer (2 layers, 8 heads) that embeds up to 256 bytes of calldata. Training data:
3.7 million manually labelled ATUs curated by 27 crypto-savvy accountants.
Hard negatives generated via protocol-specific fuzzing (e.g., Uniswap router calls with swapped parameters).
Monthly active-learning loop: low-confidence (<0.6) classifications are surfaced to domain experts; approved corrections feed the next training epoch.
Performance (5-fold CV on hold-out exchanges):
Macro-F1 (all tags)
0.972
AUC (binary taxable)
0.991
Inference latency
1.8 ms
6. Conflict Resolution & Confidence Scoring
Situations with overlapping tags (e.g., LP Deposit vs. Transfer) are disambiguated by:
Precedence graph – Domain experts assign partial orderings (Deposit > Transfer if paired token deltas ≈ zero).
Bayesian averaging – Combines heuristic certainty (1.0) with ML probability.
Consistency checks – Verifies conservation of value; rejects tags that break accounting equality.
Final output:
ATU_ID: 0xf3…
tag_primary: LP_WITHDRAW
tag_secondary: FEE
confidence: 0.94
taxable_hint: YES
Events with confidence <0.5 are flagged for manual review or user annotation inside the TaxChain dashboard.
7. Jurisdiction Mapper
Generic tags (LP_WITHDRAW
, NFT_SALE
, AIRDROP_INCOME
) are mapped to country-specific buckets via the Jurisdictional Rulebook Layer (JRL):
LP_WITHDRAW
Private Veräußerung → cap-gain
Non-taxable until fiat conversion
BIC Art. 150-VH quinquies
AIRDROP_IN
Sonstige Einkünfte (§22)
Income taxed on receipt
Bénéfices Non Commerciaux
NFT_SALE
Cap-gain (1-year rule)
Wealth-tax & income split
Flat-tax 30 %
The mapper returns both the local category code and the applicable cost-basis method (FIFO, LIFO, average).
8. Extensibility & Plugin System
Protocol Plug-ins – Any developer can deploy a WASM module that exposes
fn classify(trace) -> TagSet
. Accepted modules earn 20 % of protocol fees attributable to their tag matches.Version Pinning – Each classification output references the plug-in version hash; downgrades are impossible, ensuring reproducible audits.
Sandboxing – Plug-ins run in WASI containers with 50 ms CPU-time and 1 MiB mem caps; deterministic syscalls only.
9. Determinism & Auditability
Every classification decision is hashed (Keccak256(tag || features || pluginHash)
) and appended to the Tax Graph.
A Merkle-proof API lets auditors verify that a given ATU was classified exactly the same way at any historical time, even after rule-pack upgrades.
10. Roadmap Enhancements
ZK-Tag Proofs – Succinct zk-SNARK circuits will allow private proof of “these ATUs are ‘capital gains’ under DE law” without revealing the underlying transaction ids.
Intent Overlay – Integration with ERC-7521 will classify off-chain intents before they settle on-chain, enabling pre-trade tax simulations.
Natural-Language Explanations – LLM-powered module (offline, deterministic prompts) to produce human-readable rationales for each tag, aiding advisor transparency.
The Classification Engine transforms raw blockchain noise into structured, jurisdiction-aware tax intelligence with cryptographic auditability and sub-millisecond latency. Its hybrid design—rule-based certainty, machine learning flexibility, and on-chain verifiability—ensures TaxChain remains accurate, transparent, and future-proof as Web3 evolves.
Last updated