Beating a Decade-Old Deception Detection Benchmark:
## The Michigan Benchmark: Multimodal Muscle on Real Trial Data
The Michigan team's setup is a masterclass in multimodal ambition. They scraped ~121 video clips (59 deceptive, 62 truthful) from public U.S. court trials—defendants and witnesses under oath, with real motives (freedom, convictions) that lab-elicited "mock crimes" can't touch.
Transcripts were run through LIWC (Linguistic Inquiry and Word Count), pulling 100+ normalized features like cognitive processes (e.g., % hedges like "approximately"), emotional tone, and first-person pronouns—hallmarks of cognitive load in lies per psycholinguistics lore.
Nonverbals? They hand-annotated gestures (e.g., self-adaptors like fidgeting, which spike in stress) across categories like emblems and illustrators, yielding binary flags.
Classifiers (likely SVMs or early RFs—papers are light on details) fused these at feature or decision levels: Verbal alone ~70%, gestures ~65%, multimodal fusion pushing 75-82% accuracy (and AUCs ~0.85-0.90).
No explicit cross-validation mentioned (they used train-test splits, per the 2015 EMNLP/NAACL proceedings), but the real-world stakes make it a beast for generalization testing.
Human eval? ~54% accuracy—our squishy brains lag behind silicon when spotting evasion in flat testimony. The dataset's public (h/t Michigan's LIT lab), so I downloaded it verbatim: 121 segments, balanced labels, raw transcripts + gesture binaries. Goal: Beat 82% verbal+nonverbal with *verbals only*, using off-the-shelf tools.
## My Counterpunch: Minimalism Meets Raw Power
I flipped the script—opposite everything. Where Michigan went broad and normalized, I went lean and literal. Here's the pipeline:
### Data Prep
- **Verbal Features**: Ditched LIWC's percentage outputs (e.g., % of words in "negate" category). Percentages dilute repetition signals— a liar's five "no recollections" in a 50-word clip?
That's 10% negation, but raw count=5 screams denial harder than a normalized 0.10. Used Kris Kyle's CLA (Python word counter from UH Mānoa) with a custom Harvard General Inquirer (GI) dictionary—vintage 1960s, but with 11k+ tags across Lasswell power/affiliation buckets and GI's interpretive verbs/states. Pulled raw frequencies for *five* categories that popped in exploratory PCA:
- **Card_GI**: Cardinal numbers (e.g., "one," "half mile")—liars lowball specifics.
- **Sv_GI**: State verbs (e.g., "recall," "exhausted")—internal hedging overload.
- **Polit_2_GI**: Political/ideological refs (e.g., "police," "ally")—avoidance of authority.
- **Quan_GI**: Quantity assessors (e.g., "approximately," "period")—vague abundance.
- **Notlw_Lasswell**: Negations/denials (e.g., "no," "unsuccessful")—defensive caps.
Why these? GI's granularity (vs. LIWC's broader buckets) snags forensic nuances like power evasion; raw counts preserve cognitive effort (more words = more fabrication tax).
- **No Nonverbals (Yet)**: Ignored their hard-won gesture codes—binary 0/1 for adaptors, etc. (I'll fuse 'em next; binaries might drag if not scaled, per my tests).
- **Total Features**: 5 raw ints per segment. Michigan? 100+ normalized floats + binaries. Less is interpretable.
### Modeling: FIGS for Greedy, Readable Trees
- **Why FIGS?** I've cartwheeled through CART/boosted trees and XGBoost (JMP plugin FTW) for years, but Fast Interpretable Greedy-tree Sums (FIGS) is the sleeper hit:
Additive ensemble of shallow trees (max_rules=5 here), summing leaf "Val" scores for probs (1=lie). Greedy splits on impurity, but ultra-readable—no XGBoost opacity. Trained on 121 samples, random_state=100 for repro.
FIGS output looks like this:
> FIGS-Fast Interpretable Greedy-Tree Sums:
> Predictions are made by summing the "Val" reached by traversing each tree
> ------------------------------
Card_GI <= 0.500 (Tree #0 root)
Sv_GI <= 6.500 (split)
Polit_2_GI <= 0.500 (split)
Val: 0.765 (leaf)
Val: 0.340 (leaf)
Quan_GI <= 9.500 (split)
Val: 0.022 (leaf)
Val: 0.819 (leaf)
Val: 0.222 (leaf)
+
Notlw_Lasswell <= 1.500 (Tree #1 root)
Val: -0.077 (leaf)
Val: 0.246 (leaf)
-----------------------------------
- **Validation**: 5-fold CV, repeated 3x (15 folds total)—Michigan's splits were static; CV catches overfitting in imbalanced deception data. Metrics: Accuracy, precision/recall/F1 on lies (minority class), ROC-AUC.
# FIGS
model = FIGSClassifier(max_rules=5)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted') # Repeat 3x
```
## Results: 80% Lie Recall, Verbal-Only—And Interpretable Rules
Boom: 49/61 lies flagged (>0.5 prob threshold), ~80% accuracy overall (vs. Michigan's 75% verbal, 82% fused). F1 ~0.67 avg across folds (dips to 0.52 in noisy ones, spikes to 0.90+), AUC ~0.65-0.92—volatile but domain-realistic for small N. Humans? Still ~54%. My verbal-only edges their multimodal without the annotation grind.
The rules? Gold for debugging:
- Root: Low Card_GI (sparse numbers) → Cascade to high Sv_GI (state verbs like "shock") + low Polit_2_GI → Val=0.765 (lie: detached internals, no power refs).
- High Quan_GI (hedges) → Val=0.819 (vague filler).
- High Notlw (>1.5 negations) → +0.246 (defensive boost).
Sum 'em: Lies hit >0.5; truths <0.3. E.g., on Kennedy's Chappaquiddick statement (bonus test): Last two segments (fabricated "dives") score 0.765—high Sv ("exhausted"), low Card. Portability win.
Kennedy Statement segmented into 5 pieces and tested with FIGS 5 rules:
figs5raw lie
0.34 0
0.222 0
0.34 0
0.765 1
0.765 1
| Metric | Michigan Verbal | Michigan Multimodal | My Verbal-Only (FIGS CV Avg) |
|---|---|---|---|
| Accuracy | ~70% | 75-82% | 80% (lies: 80%) |
| F1 (Lies) | N/A | ~0.78 | 0.67 |
| Features | 100+ (LIWC %) | + Gestures | 5 (GI raw) |
| Val Method | Train-test | Train-test | 5-fold x3 CV |
## Why It Works: Raw > Normalized, FIGS > Fusion Bloat
Michigan's fusion is elegant but brittle—LIWC %s smooth out liar verbosity (e.g., rambling quantifiers dilute to 5%, but raw=9 flags hedging). GI's verb/state focus (Sv_GI) nails interpretive leakage better than LIWC's psych buckets; raw counts amplify repetition as effort proxy. FIGS? Its additive sums yield causal-ish rules (e.g., "low specifics + high internals = evasion") without SVM hyperparameters—perfect for forensic explainability (e.g., "This testimony hedges 12x? Prob lie=0.82").
They innovated data; I iterated methods. Ten years on, no CV or raw explorations? Room to build. (Pro tip: Binaries tanked my tests to ~65%—scale 'em next?)
## Next: Gesture Fusion and Beyond
Adding Michigan's gesture binaries to my GI raws—expect 85%? Stay tuned. Fork the repo, run on tobacco trials (UCSF docs library's a deception motherlode), or hit VERBALIE for forensic interviews. Deception detection's ripe for open-source revival—let's crowdsource better benchmarks.
Thoughts? Drop a comment or PR. Code/data: [GitHub link]. #DeceptionDetection #elastictruth #InterpretableML

_tree_plot.png)

No comments
Post a Comment