Beating a Decade-Old Deception Detection Benchmark:

80% Verbal-Only Accuracy with Raw GI Counts and FIGS

*Posted on December 28, 2025, by Tom Berger – A hobbyist dive into forensic linguistics and interpretable ML*

In the high-stakes world of automated deception detection, few benchmarks have endured like the University of Michigan's multimodal trial dataset.

Just last month (November 2025), lead researcher Rada Mihalcea and collaborators snagged the ICMI Ten-Year Impact Runner-up Award for their pioneering work on real-life courtroom deception spotting.

It's a well-deserved nod—their 2015 papers, "Deception Detection using Real-life Trial Data" and "Verbal and Nonverbal Clues for Real-life Deception Detection," clocked multimodal accuracies of 75-82%, outpacing human baselines (~54%) and setting a gold standard for fusing linguistics with gestures in adversarial settings like trials. But here's the rub: That was *ten years ago*.

No major updates, despite the explosion in LLMs and tree ensembles. As someone tinkering with open-source tools for fun (and a bit of forensic curiosity), I wondered: Can a stripped-down, verbal-only pipeline—using a 1960s dictionary, raw word counts, and a greedy tree-summing model—top their results? Spoiler: Yes. I hit 80% accuracy on lie detection with *five features*. No gestures, no black-box fusion. Just words, counts, and interpretability.

This post breaks it down for the technically inclined (think NLP/ML folks hunting deception signals) while keeping the gist accessible: Sometimes, less is more, and raw data beats polished percentages. Grab the code/data tweaks from my GitHub [link TBD], and let's dissect.

## The Michigan Benchmark: Multimodal Muscle on Real Trial Data

The Michigan team's setup is a masterclass in multimodal ambition. They scraped ~121 video clips (59 deceptive, 62 truthful) from public U.S. court trials—defendants and witnesses under oath, with real motives (freedom, convictions) that lab-elicited "mock crimes" can't touch.

Transcripts were run through LIWC (Linguistic Inquiry and Word Count), pulling 100+ normalized features like cognitive processes (e.g., % hedges like "approximately"), emotional tone, and first-person pronouns—hallmarks of cognitive load in lies per psycholinguistics lore.

Nonverbals? They hand-annotated gestures (e.g., self-adaptors like fidgeting, which spike in stress) across categories like emblems and illustrators, yielding binary flags.

Classifiers (likely SVMs or early RFs—papers are light on details) fused these at feature or decision levels: Verbal alone ~70%, gestures ~65%, multimodal fusion pushing 75-82% accuracy (and AUCs ~0.85-0.90).

No explicit cross-validation mentioned (they used train-test splits, per the 2015 EMNLP/NAACL proceedings), but the real-world stakes make it a beast for generalization testing.

Human eval? ~54% accuracy—our squishy brains lag behind silicon when spotting evasion in flat testimony. The dataset's public (h/t Michigan's LIT lab), so I downloaded it verbatim: 121 segments, balanced labels, raw transcripts + gesture binaries. Goal: Beat 82% verbal+nonverbal with *verbals only*, using off-the-shelf tools.

## My Counterpunch: Minimalism Meets Raw Power

I flipped the script—opposite everything. Where Michigan went broad and normalized, I went lean and literal. Here's the pipeline:

### Data Prep

- **Verbal Features**: Ditched LIWC's percentage outputs (e.g., % of words in "negate" category). Percentages dilute repetition signals— a liar's five "no recollections" in a 50-word clip?

That's 10% negation, but raw count=5 screams denial harder than a normalized 0.10. Used Kris Kyle's CLA (Python word counter from UH Mānoa) with a custom Harvard General Inquirer (GI) dictionary—vintage 1960s, but with 11k+ tags across Lasswell power/affiliation buckets and GI's interpretive verbs/states. Pulled raw frequencies for *five* categories that popped in exploratory PCA:

- **Card_GI**: Cardinal numbers (e.g., "one," "half mile")—liars lowball specifics.

- **Sv_GI**: State verbs (e.g., "recall," "exhausted")—internal hedging overload.

- **Polit_2_GI**: Political/ideological refs (e.g., "police," "ally")—avoidance of authority.

- **Quan_GI**: Quantity assessors (e.g., "approximately," "period")—vague abundance.

- **Notlw_Lasswell**: Negations/denials (e.g., "no," "unsuccessful")—defensive caps.

Why these? GI's granularity (vs. LIWC's broader buckets) snags forensic nuances like power evasion; raw counts preserve cognitive effort (more words = more fabrication tax).

- **No Nonverbals (Yet)**: Ignored their hard-won gesture codes—binary 0/1 for adaptors, etc. (I'll fuse 'em next; binaries might drag if not scaled, per my tests).

- **Total Features**: 5 raw ints per segment. Michigan? 100+ normalized floats + binaries. Less is interpretable.

### Modeling: FIGS for Greedy, Readable Trees

- **Why FIGS?** I've cartwheeled through CART/boosted trees and XGBoost (JMP plugin FTW) for years, but Fast Interpretable Greedy-tree Sums (FIGS) is the sleeper hit:

Additive ensemble of shallow trees (max_rules=5 here), summing leaf "Val" scores for probs (1=lie). Greedy splits on impurity, but ultra-readable—no XGBoost opacity. Trained on 121 samples, random_state=100 for repro.

FIGS output looks like this:

> FIGS-Fast Interpretable Greedy-Tree Sums:

> Predictions are made by summing the "Val" reached by traversing each tree

> ------------------------------

Card_GI <= 0.500 (Tree #0 root)

Sv_GI <= 6.500 (split)

Polit_2_GI <= 0.500 (split)

Val: 0.765 (leaf)

Val: 0.340 (leaf)

Quan_GI <= 9.500 (split)

Val: 0.022 (leaf)

Val: 0.819 (leaf)

Val: 0.222 (leaf)

Notlw_Lasswell <= 1.500 (Tree #1 root)

Val: -0.077 (leaf)

Val: 0.246 (leaf)

-----------------------------------

A simple, readable, additive tree.

- **Validation**: 5-fold CV, repeated 3x (15 folds total)—Michigan's splits were static; CV catches overfitting in imbalanced deception data. Metrics: Accuracy, precision/recall/F1 on lies (minority class), ROC-AUC.

# FIGS

model = FIGSClassifier(max_rules=5)

cv_scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted') # Repeat 3x

```

## Results: 80% Lie Recall, Verbal-Only—And Interpretable Rules

Boom: 49/61 lies flagged (>0.5 prob threshold), ~80% accuracy overall (vs. Michigan's 75% verbal, 82% fused). F1 ~0.67 avg across folds (dips to 0.52 in noisy ones, spikes to 0.90+), AUC ~0.65-0.92—volatile but domain-realistic for small N. Humans? Still ~54%. My verbal-only edges their multimodal without the annotation grind.

The rules? Gold for debugging:

- Root: Low Card_GI (sparse numbers) → Cascade to high Sv_GI (state verbs like "shock") + low Polit_2_GI → Val=0.765 (lie: detached internals, no power refs).

- High Quan_GI (hedges) → Val=0.819 (vague filler).

- High Notlw (>1.5 negations) → +0.246 (defensive boost).

Sum 'em: Lies hit >0.5; truths <0.3. E.g., on Kennedy's Chappaquiddick statement (bonus test): Last two segments (fabricated "dives") score 0.765—high Sv ("exhausted"), low Card. Portability win.

Kennedy Statement segmented into 5 pieces and tested with FIGS 5 rules:

figs5raw lie

0.34 0

0.222 0

0.34 0

0.765 1

Metric	Michigan Verbal	Michigan Multimodal	My Verbal-Only (FIGS CV Avg)
Accuracy	~70%	75-82%	80% (lies: 80%)
F1 (Lies)	N/A	~0.78	0.67
Features	100+ (LIWC %)	+ Gestures	5 (GI raw)
Val Method	Train-test	Train-test	5-fold x3 CV

## Why It Works: Raw > Normalized, FIGS > Fusion Bloat

Michigan's fusion is elegant but brittle—LIWC %s smooth out liar verbosity (e.g., rambling quantifiers dilute to 5%, but raw=9 flags hedging). GI's verb/state focus (Sv_GI) nails interpretive leakage better than LIWC's psych buckets; raw counts amplify repetition as effort proxy. FIGS? Its additive sums yield causal-ish rules (e.g., "low specifics + high internals = evasion") without SVM hyperparameters—perfect for forensic explainability (e.g., "This testimony hedges 12x? Prob lie=0.82").

They innovated data; I iterated methods. Ten years on, no CV or raw explorations? Room to build. (Pro tip: Binaries tanked my tests to ~65%—scale 'em next?)

## Next: Gesture Fusion and Beyond

Adding Michigan's gesture binaries to my GI raws—expect 85%? Stay tuned. Fork the repo, run on tobacco trials (UCSF docs library's a deception motherlode), or hit VERBALIE for forensic interviews. Deception detection's ripe for open-source revival—let's crowdsource better benchmarks.

Thoughts? Drop a comment or PR. Code/data: [GitHub link]. #DeceptionDetection #elastictruth #InterpretableML

Beating a Decade-Old Deception Detection Benchmark:

80% Verbal-Only Accuracy with Raw GI Counts and FIGS

## The Michigan Benchmark: Multimodal Muscle on Real Trial Data

## My Counterpunch: Minimalism Meets Raw Power

No comments

Post a Comment

google

statcounter