Shipping Llama 3 at a Museum: OCR, JSON, and a Qt Desktop for 10,000 Vinyls

Museum catalogs live or die on consistent metadata. For a large LP collection in ethnomusicology, that metadata often starts on the object itself: tiny type, multiple languages, stickers, and decades of wear. Typing that by hand does not scale. In the field, the pain is blunt: teams reported spending on the order of eight hours to classify roughly four records by hand, while a tool-assisted workflow later allowed on the order of fifty records in a single day. That is not a marginal tweak. It is a different operating model.

This article walks through MegaCat, a Python desktop app I built and that is used by the Ethnographic Museum of Geneva (MEG) to process their LP vinyl collection. MegaCat combines cover OCR (Tesseract), structured inference with Meta Llama 3 8B Instruct via the Hugging Face Inference API, post-processing to normalize outputs, and a PySide6 interface so curators can import batches, edit results, validate, and cross-check Discogs in a browser pane. If you are aiming for AI forward deployed roles, the through line is simple: model quality only matters once the pipeline, UX, and failure modes match how real users work.

By the end, you will understand the end-to-end flow, where the model is constrained, and what I would extend next. The code is public on GitHub: https://github.com/LucaZoss/MegaCat_LLM-App

What “done” looks like for the user: folders of LP scans in, combined OCR text, JSON with general information and per-track rows, cleaned exports, and a UI to correct mistakes before anything is treated as final.

Prerequisites

Python 3 with venv (Windows and macOS paths differ only at activation).
Tesseract installed for pytesseract, plus whatever your OS needs for pdf2image poppler backends.
A Hugging Face API token stored as HUG_MEG_SECRET in a .env file (see the repo README).
Basic familiarity with Qt style desktop apps and with JSON schema style constraints on chat completions.

The problem: messy inputs, strict catalog needs

Ethnomusicology LPs are not uniform products. Covers mix roman and non-Latin scripts, publisher and label lines that are easy to confuse, and track lists that OCR often scrambles. A generic “summarize this text” prompt would produce prose, not rows a collection management system can ingest.

MegaCat treats the job as extract, structure, clean, verify:

scans per LP folder
↓
OCR combined text per LP
↓
Llama 3 with a fixed JSON target shape
↓
cleaner merges curator fields and normalizes files
↓
Qt UI for review, validation, Discogs lookup

Architecture at a glance

Desktop (PySide6)
  Curator → Importer/Batch → Editor → Finalizer
                                    → Discogs WebView

Pipeline (ds_pipeline)
  OCRPipeline → 0_ocr_combined_txt
  → AIClassifier → 1_json_inf_outputs
  → Cleaner → 2_json_cleaned
  → Hugging Face Inference API (cloud)

The desktop shell wires pages together and persists work under a ds_pipeline directory created at startup:

# main.py
self.ds_pipeline_dir = ROOT_DIR / "ds_pipeline"
self.ds_pipeline_dir.mkdir(parents=True, exist_ok=True)
self.load_page = LoadWindow()
self.final_page = Finalizer(self.ds_pipeline_dir)
self.load_page.inference_complete.connect(self.open_editor_w_data)

That snippet is small, but it encodes a product decision: the app owns the data root, creates pipeline folders if missing, and only spins up the heavy editor after inference signals completion, so the user is not dropped into an empty state.

Orchestration: resumable OCR, then model, then clean

The Orchestrator class is the spine. It sorts LP folders, skips OCR when LPxxxx_combined.txt already exists (so reruns are cheap), runs OCR with a worker pool, then classification, then cleaning. Progress callbacks split the bar into rough thirds (OCR, model, clean), which is the kind of UX honesty field users notice.

# ai_inference/orchestrator.py
def orchestrate(self, progress_callback=None, status_callback=None):
    lp_skipped_ocr = self.run_ocr(progress_callback, status_callback)
    lp_list_for_next_steps = self.lp_sort_and_list()
    lp_to_process = lp_skipped_ocr + [
        lp for lp in lp_list_for_next_steps
        if lp not in lp_skipped_ocr
    ]
    self.run_classifier(progress_callback, status_callback, lp_to_process)
    self.run_cleaner(progress_callback, status_callback, lp_to_process)

run_ocr builds ds_pipeline/0_ocr_combined_txt, checks for existing combined files, and only queues missing work. That pattern is what makes the tool tolerable on large collections: you can stop, fix lighting, rescan one cover, and not pay the full OCR bill again for the whole museum.

Where the “AI” is: JSON contracts, not vibes

The interesting code in most assistant demos is the prompt. In production-ish tools, the interesting code is usually prompt plus response contract plus parse and fallback.

MegaCat sends a long user message that spells out fields, ordering rules, sentence case, and language translation rules, and pairs it with a response_format payload so the Hub client steers the model toward valid JSON. It calls meta-llama/Meta-Llama-3-8B-Instruct with chat.completions.create, then extracts JSON with a regex fallback if the model wraps extra text.

# ai_inference/ai_labeler_inf.py
completion = self.client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=message,
    response_format=response_format,
    temperature=0.7,
    max_tokens=1000,
    top_p=0.95,
    seed=42,
)

If parsing fails, the code still writes a minimal fallback JSON with LP_ID and an empty Track Info list so downstream steps and the UI do not crash on a single bad completion. That is the kind of guardrail that matters in a real deployment.

Dependencies and setup

Pinned deps live in requirements.txt (excerpt):

PySide6==6.8.0.2
pytesseract==0.1.6
huggingface-hub==0.26.2
python-dotenv==1.0.1
pandas==2.2.3

Typical local setup:

python -m venv venv
source venv/bin/activate   # or venv\Scripts\activate on Windows
pip install -r requirements.txt
python main.py

Field realities: connectivity and third-party lookup

The main window polls connectivity with requests and paints a red or green indicator. Offline OCR might still be useful in theory, but hosted inference needs a stable path out. The Discogs view is a QWebEngineView pointed at https://www.discogs.com, which keeps humans in familiar territory when the model is unsure.

What I would ship next

The README still lists roadmap items (richer QMessage handling, broader executables, deeper Discogs integration). If I were pitching this in an FDE loop, I would emphasize observability (per-LP latency and cost), golden-file regression tests on OCR snippets, and schema versioning so catalog migrations do not break when fields evolve.

Conclusion

MegaCat is a compact example of applied LLM work tied to a physical collection: OCR as the sensor, Llama 3 as the structured extractor, cleaning and IDs as the integration layer, and Qt as the control surface curators actually use. The measurable outcome (order of magnitude less time per record at MEG) is the headline. The implementation details — skippable OCR, JSON contracts, fallbacks, connectivity UX — are what show you can carry a model across the finish line in a real institution.

Repo: https://github.com/LucaZoss/MegaCat_LLM-App