Simulating an Order Book: XGBoost Events, Gaussian Returns, and Gamma Volumes

If you study algorithmic trading, you quickly hit a tension. Live markets are expensive to replay at full fidelity, and toy models (pure Poisson arrivals, IID noise) look nothing like the bursty, directional microstructure you see in real message feeds. In 2024, our group set out to build a market simulation that stayed close to historic limit order book behavior: same order-type vocabulary, similar return and size marginals, and a time axis fine enough to think in milliseconds, then resample to something stable for fitting.

This repository is the public snapshot of that work: Python, pandas time logic, scipy distributions, an XGBoost classifier served from xgboost_model.pkl, and two classes that own the story end to end: MarketSim (ingest, fit, simulate) and SimulationRunner (repeat runs, append logs). The goal was not a production exchange simulator. It was a credible synthetic generator for coursework and research prototypes, with a clear separation between statistical price and volume sampling and ML-guided event timing.

Code: https://github.com/LucaZoss/Orderbook-MarketSimulation_AlgoTrading

Prerequisites

Python 3 with pip and a virtual environment.
Dependencies from requirements.txt (core stack is numpy, pandas, matplotlib, scipy, scikit-learn, joblib; xgboost is implied by the pickled model and training notebook).
CSV inputs under data/: message_data.csv and orderbook_data.csv in the layout expected by MarketSim.
Comfort reading LOBSTER-style event types (numeric Type column) and basic order book columns.

pip install -r requirements.txt

The problem: correlated order types, heavy data, fragile Hawkes shortcuts

Real message streams autocorrelate. A naive Poisson model for order-type counts is easy to code but assumes independence between events. A multivariate Hawkes process is attractive on paper, yet the group tried HawkesLib and hit a practical wall: stale maintenance and scale on the order of hundreds of thousands of rows.

The compromise we shipped is supervised learning for the next event type: build a target that shifts event_type forward one row, engineer lags on time, size, price, mid, and direction, then use predict_proba to drive a stream of probabilities. The goal was less about top-1 accuracy and more about capturing internal patterns well enough to seed a simulator.

Stage 1: engineering the merged LOB plus message table

MarketSim.__init__ loads both CSVs, assigns message columns, renames LOB depth columns in repeating ask_price / ask_size / bid_price / bid_size groups, merges on Time, filters message Type 5, derives mid_price and a buy or sell calc_direction, then maps composite keys into an integer event_type.

# src/market_sim2_xgb.py
self.merge_df = self.merge_df[self.merge_df['Type'] != 5]
self.merge_df['mid_price'] = (
    self.merge_df['ask_price_1'] + self.merge_df['bid_price_1']) / 2
self.merge_df['calc_direction'] = np.where(
    self.merge_df['mid_price'] < self.merge_df['Price'], 'Buy', 'Sell')
self.merge_df['event_type'] = self.merge_df['event_type'].replace({
    '1_Sell': 1, '1_Buy': 2, '2_Sell': 3, '2_Buy': 4,
    '3_Sell': 5, '3_Buy': 6, '4_Sell': 7, '4_Buy': 8
}).astype(int)

Stage 2: XGBoost probabilities with lag features

simulate_event_xgb_prob loads the pickle with joblib, label-encodes direction, builds three lags of the key fields, drops NA rows, standardizes lag columns, seeds the loop from the last observed feature vector, then for each future step reads predict_proba and rolls the feature vector using argmax of the last predicted distribution.

# src/market_sim2_xgb.py
for lag in range(1, 4):
    df_r.loc[:, f'Time_lag{lag}'] = df_r['Time'].shift(lag)
    df_r.loc[:, f'Size_lag{lag}'] = df_r['Size'].shift(lag)
    df_r.loc[:, f'Price_lag{lag}'] = df_r['Price'].shift(lag)
    df_r.loc[:, f'mid_price_lag{lag}'] = df_r['mid_price'].shift(lag)
    df_r.loc[:, f'calc_direction_lag{lag}'] = df_r['calc_direction'].shift(lag)

Stage 3: price returns with Gaussian MLE, volumes with Gamma fits

Price path: resample mid_price to 1 minute OHLC separately for buy and sell sides, compute close to close returns, then scipy.stats.norm.fit and np.random.normal for synthetic return draws.

Volume path: split Size by event_type (limit and market buckets in code), resample to 1 minute OHLC on Size, then gamma.fit and gamma.rvs with guards for empty or ill-conditioned fits.

# src/market_sim2_xgb.py
mean_returns_mle, std_returns_mle = norm.fit(df[df_price_col].dropna())
sample_gaussian_returns = np.random.normal(
    loc=mean_returns_mle, scale=std_returns_mle, size=num_samples)

c, loc, scale = gamma.fit(data_clean)
sample_gamma_volume = gamma.rvs(c, loc, scale, size=num_samples)

Stage 4: the market kernel

simulate_market walks the probability grid. When a column clears 0.5, it parses the suffix as order_type, draws a return from the appropriate buy or sell pool, updates price, draws volume from the matching Gamma pool, and appends a row. The implementation also branches into partial and full cancellation paths for other event indices.

# src/market_sim2_xgb.py
for index, row in probabilities_df.iterrows():
    for col in probabilities_df.columns:
        if row[col] > 0.5:
            order_type = int(col.split('_')[-1])
            if order_type == 1:
                return_price = random.choice(sim_returns_sell_norm)
                price = start_price * (1 + return_price)
                volume = random.choice(sim_vol_sell_lim_gam)
                results.append(
                    {'Time': index, 'OrderType': 1, 'Direction': -1,
                     'Price': price, 'Volume': volume})
                start_price = price

Orchestration and logging

run_market_simulation wires preprocessing, simulate_event_xgb_prob, price and volume simulation, then simulate_market. SimulationRunner.run_simulations constructs a fresh MarketSim per iteration, concatenates sim_data into simulation_results.csv, and can return one big DataFrame when concat=True.

Plain text flow:

message CSV + LOB CSV
↓
merge, mid_price, event_type
↓
XGBoost rolling proba over num_steps
↓
fit Normal returns + Gamma volumes
↓
threshold kernel → synthetic orders
↓
append CSV logs per simulation

Conclusion

This project is a 2024-era algorithmic trading coursework artifact that still reads well as a systems story: merge heterogeneous LOB inputs, respect scale limits when choosing Hawkes versus ML, fit interpretable marginals, and keep a thin orchestration layer so you can rerun synthetic days and log them for downstream strategy experiments.

If you revive it today, the highest leverage upgrades are parameterized paths (no hard-coded num_steps), deterministic seeds surfaced in the API, xgboost pinned in requirements, and unit tests on filter_and_calculate and simulate_market invariants.

Repo: https://github.com/LucaZoss/Orderbook-MarketSimulation_AlgoTrading