Classification Engine

Once raw blockchain activity has been ingested and written into each user’s Tax Graph, the next challenge is to translate that low-level data into events that accountants and tax authorities actually recognise. The Classification Engine (CE) is the subsystem that performs this translation. It assigns every Atomic Tax Unit (ATU)a precise economic meaning—trade, staking reward, LP deposit, DAO distribution, NFT sale, and so on—together with a machine-readable confidence score. What follows is an in-depth, technical look at how the CE works.


1. High-Level Architecture

┌──────────────┐     TxEvents     ┌──────────────────┐
│ Listener     │ ───────────────► │ Canonicaliser    │
└──────────────┘                  └────────┬─────────┘
                                           │  Normalised traces

                                   ┌──────────────────┐
                                   │ Feature Builder  │
                                   └────────┬─────────┘
                                           │  Dense feature vectors

                          ┌───────── Heuristic Ruleset ─────────┐
                          │                                     │
                          └────────┬────────────────────────────┘

                             ML Classifier


                         ┌────────────────────┐
                         │ Conflict Resolver  │
                         └────────┬───────────┘
                                  │  ATU + tags + score

                          Jurisdiction Mapper
  • Canonicaliser normalises traces coming from heterogeneous chains.

  • Feature Builder extracts ~250 deterministic features (op-codes, calldata hashes, balance deltas, oracle prices, gas patterns, etc.).

  • Heuristic Ruleset applies fast, hand-crafted signatures for well-known protocols.

  • ML Classifier handles everything not caught by heuristics, using gradient-boosted trees and a small transformer for calldata embeddings.

  • Conflict Resolver merges results, chooses the dominant tag(s), and outputs confidence scores.

  • Jurisdiction Mapper converts generic tags into jurisdiction-specific tax categories (e.g., “UK: S.104 Pool” vs. “DE: private Veräußerungsgeschäft”).


2. Canonicalisation Layer

Different chains expose traces in different formats (EVM call-trees, Solana inner-instructions, Cosmos ABCI events). The Canonicaliser converts them into TaxChain IL (Intermediate Language):

CALL      0xUniswapV3Router       method=swapExactTokensForETH
TRANSFER  USDC 1500  →  0xRouter
TRANSFER  ETH  0.71  ←  0xWETH
EMIT      PairSync(...)

A hash of the IL representation becomes the classification fingerprint, ensuring idempotent processing and deterministic re-runs.


3. Feature Builder

Key engineered features include:

Feature Group
Examples (per ATU)

Contract fingerprints

4-byte method ID, byte-grind hash of first 128 B

Token metadata

ERC-20 symbol/decimals, Coingecko categories

Flow metrics

In/out token ratio, net USD exposure change

Temporal context

Blocktime, slot, DEX TWAP position

Relational signals

Prior interactions with address X in last N txs

Gas patterns

First 16 op-codes, intrinsic gas / calldata size

These 250 features occupy <800 bytes per ATU after Run-Length + ZSTD compression.


4. Heuristic Rule Layer

Fast path for deterministic matches (≈ 70 % of traffic):

Heuristic ID
Match Condition
Tag Emitted

H-A1

methodId == 0xe4e2b673 (Uniswap V3 exactInput)

Swap / Trade

H-B7

Event Transfer where from == 0x0…0

Token Mint / Airdrop

H-D2

to == 0xLido AND calldata[0] == submit(bytes32)

Staking Deposit

H-E6

Solana inner-instruction Withdraw on raydium::amm

LP Withdraw

Rules are encoded in a DSL stored on-chain; DAO governance can hot-patch without redeploying contracts.


5. Machine-Learning Classifier

Any ATU not labelled by heuristics flows into an XGBoost ensemble (200 trees, max-depth 9) augmented by a Mini-Transformer (2 layers, 8 heads) that embeds up to 256 bytes of calldata. Training data:

  • 3.7 million manually labelled ATUs curated by 27 crypto-savvy accountants.

  • Hard negatives generated via protocol-specific fuzzing (e.g., Uniswap router calls with swapped parameters).

  • Monthly active-learning loop: low-confidence (<0.6) classifications are surfaced to domain experts; approved corrections feed the next training epoch.

Performance (5-fold CV on hold-out exchanges):

Metric
Value

Macro-F1 (all tags)

0.972

AUC (binary taxable)

0.991

Inference latency

1.8 ms


6. Conflict Resolution & Confidence Scoring

Situations with overlapping tags (e.g., LP Deposit vs. Transfer) are disambiguated by:

  1. Precedence graph – Domain experts assign partial orderings (Deposit > Transfer if paired token deltas ≈ zero).

  2. Bayesian averaging – Combines heuristic certainty (1.0) with ML probability.

  3. Consistency checks – Verifies conservation of value; rejects tags that break accounting equality.

Final output:

ATU_ID: 0xf3…
 tag_primary:   LP_WITHDRAW
 tag_secondary: FEE
 confidence:    0.94
 taxable_hint:  YES

Events with confidence <0.5 are flagged for manual review or user annotation inside the TaxChain dashboard.


7. Jurisdiction Mapper

Generic tags (LP_WITHDRAW, NFT_SALE, AIRDROP_INCOME) are mapped to country-specific buckets via the Jurisdictional Rulebook Layer (JRL):

Tag
Germany (DE)
Switzerland (CH)
France (FR)

LP_WITHDRAW

Private Veräußerung → cap-gain

Non-taxable until fiat conversion

BIC Art. 150-VH quinquies

AIRDROP_IN

Sonstige Einkünfte (§22)

Income taxed on receipt

Bénéfices Non Commerciaux

NFT_SALE

Cap-gain (1-year rule)

Wealth-tax & income split

Flat-tax 30 %

The mapper returns both the local category code and the applicable cost-basis method (FIFO, LIFO, average).


8. Extensibility & Plugin System

  • Protocol Plug-ins – Any developer can deploy a WASM module that exposes fn classify(trace) -> TagSet. Accepted modules earn 20 % of protocol fees attributable to their tag matches.

  • Version Pinning – Each classification output references the plug-in version hash; downgrades are impossible, ensuring reproducible audits.

  • Sandboxing – Plug-ins run in WASI containers with 50 ms CPU-time and 1 MiB mem caps; deterministic syscalls only.


9. Determinism & Auditability

Every classification decision is hashed (Keccak256(tag || features || pluginHash)) and appended to the Tax Graph. A Merkle-proof API lets auditors verify that a given ATU was classified exactly the same way at any historical time, even after rule-pack upgrades.


10. Roadmap Enhancements

  1. ZK-Tag Proofs – Succinct zk-SNARK circuits will allow private proof of “these ATUs are ‘capital gains’ under DE law” without revealing the underlying transaction ids.

  2. Intent Overlay – Integration with ERC-7521 will classify off-chain intents before they settle on-chain, enabling pre-trade tax simulations.

  3. Natural-Language Explanations – LLM-powered module (offline, deterministic prompts) to produce human-readable rationales for each tag, aiding advisor transparency.


The Classification Engine transforms raw blockchain noise into structured, jurisdiction-aware tax intelligence with cryptographic auditability and sub-millisecond latency. Its hybrid design—rule-based certainty, machine learning flexibility, and on-chain verifiability—ensures TaxChain remains accurate, transparent, and future-proof as Web3 evolves.

Last updated