Deception Detection In Non Verbals, Linguistics And Data.

JonBenet Ransom Note Analysis Using Syntactic Ngrams -- Or Taking The Words Away And Looking At Structure.



New state of the art software is being released in various domains, much of which can help in stylometry analysis. I have decided to bite the bullet finally and move over from Matlab to R, the open source statistical software.

The best permutation and nonparametric combination test software is now on R -

http://caughey.mit.edu/software

This allows you to compare samples against base without worrying whether your data is complies with the normality curve, or if you have more variables than samples and so on. Devin Caughey has written some very nice papers on this, and now his software is available on R.


Now with the release of Stylo R package, I have well and truly moved over to R:

https://sites.google.com/site/computational stylistics/home

This is a superb stylometry package with some of the latest developments in stylo analysis such as Burrows Delta and Consensus Bootstrap Tree, rolling Delta etc. These guys know their stuff and have written a great program.


Two more bits of software to complete the analysis puzzle, the state  of the art Stanford Parser from the Stanford NLP Group - https://nlp.stanford.edu/software/lex-parser.shtml


And with the advent of Syntactic Ngrams by Google and others, some great ideas along these lines with with software to produce them, Dr. Gregori Sidorov has an interesting site along with some great papers he has written. He has done some interesting work on the syntactic ngrams and call them sngrams. His site and the software in Python -- http://www.cic.ipn.mx/~sidorov/


Also worth mentioning is authorship software Toccata by Richard Forsyth, along with his other software. I bought Beagle from him in the eighties, and still have fond memories of it. All his new stuff is in Python:
https://www.richardsandesforsyth.net/software.html

That's a round up of the software, so lets put it together slowly.


The Problem:


A 374 word ransom note at the scene of a murder, or accidental homicide of JonBenet Ramsey. The FBI and police and lead investigator James Kolar agree the note was part of the "staging" of the crime scene.


A staged ransom note means it is trying to portray what it is not. The writer was aware that handwriting would be extensively analysed afterwards, this alone means that handwriting analysis (physically comparing writing) would be useless in a court of law because a lot of effort  would have been made to fake and randomise the appearance of the note, and it could never be "beyond reasonable doubt."


Linguistic Analysis:


Linguistic analysis is an option and has progressed in leaps and bounds over the last few years: (Koppel,Eder, Rybicki, Hoover et al).


It has been known for a long time that people tend to write with their own "style" and using function words, for example "at", "by", "be", "but" and "can" provide linguistic fingerprints because people are unaware of these tiny words and they are not context sensitive, making them a good marker in many cases. 

By themselves they are not enough however. And so the search is on for more markers and more software to separate the signal from the noise.

WritePrint which is embedded into Jstylo (earlier post) has about 800 different variables it creates, and used to be considered the gold standard.


Another clever method used with success in a stylometry competition was by the team of Koppel, Akiva and Dagan with their "Unstable" words as markers:


http://onlinelibrary.wiley.com/doi/10.1002/asi.20428/abstract



The JonBenet Ramsey Ransom Note:


Looking at the JonBenet ransom note, means that using content words would fail. In other words, pronouns probably need to be ignored, and content words cannot be used because all ransom notes bear similarities along these lines.

One ransom note would be linked to another if you used word frequencies of "you" and "money" and "die", for example. Since the JonBenet is staged or faked (she was dead when the note was written, the note was purported to be from a "faction"), it is likely that there would be red herrings in the writing in order to attribute it to a radical group.


Any spelling mistakes, hyphens and strange letter formations etc would be obvious and probably useless as markers because the writer knew the note would be analysed, and keeping in mind the dynamics of staging, you would expect conscious errors/red herrings etc.


What we need to do is look for unconscious style markers and text structure, things that are written as habit. It is likely that just as the handwriting experts noted that the last part of the note was the most fluid, it is also likely that the last part also has the most unconscious markers due to force of habit...concentrating on staging a note in the beginning, and it becoming more "free flowing" with habit taking over at the end.

It is also likely that if the crime was covered up by the parents after the son accidentally hit JonBenet on the head with a torch in a fit of rage for snatching some pineapple from him in a midnight snack as per the CBS show (which seems to line up the evidence as the most likely scenario), it would be natural to think that both parents are involved to some extent, one dictating some text or ideas, the other writing.

People write differently to how they talk, and use different parts of the brain to process written text and verbal, so one of the parents would be dominating in their unconscious writing style unless the letter was being quoted verbatim (unlikely.)

Parts-Of Speech Analysis:

The idea is to take away the words, leaving the lexical structure of the ransom note.


This is easily done with the Stanford Parser, and also the Stanford Tagger, both in Java and I have also used the MontyLingua Tagger written in Python.


What a Speech Tagger does is replace words with parts of speech lexical categories such as Verbs, Nouns, Pronouns, Determiners etc. The most used Tags are the Penn Tree Bank of tags, of which there are 36:




This means every word in language automatically gets tagged with one of the above parts of speech tags. There are 6 different Verbs, and depending on the context of the writing, it gets it's assigned Tag from this list.

As an example, lets look at a snippet of text from the ransom note using the word "hence", and one of Patsy's notes with the word "hence" and tag them:


1 /NN of/IN eternal/JJ life/NN and/CC hence/RB ,/, no/DT hope/NN 


2 /NN of/IN the/DT money/NN and/CC hence/RB a/DT earlier/JJR delivery/NN 

The top line tells us there is a Noun followed by a Preposition and then an Adjective in the Patsy note at the top, and the ransom note below is slightly different but the lexical structure is very similar. The actual words are followed by a slash and then a tag by the parser.


Looking at the ransom note now, and deleting all the words, only keeping the parts of speech tags, it looks like this:


VB RB ! PRP VBP DT NN IN NNS WDT VBP DT JJ JJ NN. PRP VBP NN PRP$ NN CC

RB DT NN IN PRP VBZ. IN DT NN PRP VBP PRP$ NN IN PRP$ NN. PRP VBZ JJ CC
JJ CC IN PRP VBP PRP$ TO VB CD, PRP MD VB PRP$ NNS TO DT NN. PRP MD VB
CD, CD CD IN PRP$ NN. CD, CD MD VB IN CD NNS CC DT VBG CD, CD IN CD NNS.
VB JJ IN PRP VBP DT JJ NN NN TO DT NN. WRB PRP VBP NN PRP MD VB DT NN IN
DT JJ NN NN. PRP MD VB PRP IN CD CC CD VBP NN TO VB PRP IN NN. DT NN MD
VB VBG RB PRP VBP PRP TO VB VBN. IN PRP VBP PRP VBG DT NN JJ, PRP MD VB
PRP JJ TO VB DT JJR NN IN DT NN CC RB DT JJR NN NN IN PRP$ NN. DT NN IN
PRP$ NNS MD VB IN DT JJ NN IN PRP$ NN. PRP MD RB VB VBN PRP$ NNS IN JJ
NN. DT CD NNS VBG IN PRP$ NN VBP RB RB IN PRP RB PRP VBP PRP RB TO VB
PRP. VBG TO NN IN PRP$ NN, JJ IN NNP,NN, FW, MD VB IN PRP$ NN VBG
VBD. IN PRP VBP PRP VBG TO DT JJ NN, PRP VBZ. IN PRP JJ NN NNS, PRP VBZ.
IN DT NN VBZ IN DT NN VBN CC VBD IN, PRP VBZ. PRP MD VB VBN IN JJ NNS
CC IN DT VBP VBN , PRP VBZ. PRP MD VB TO VB PRP CC VB VBN IN PRP VBP JJ
IN NNP NN NNS CC NNS. PRP VBP DT CD NN IN VBG PRP$ NN IN PRP VBP TO IN
JJ PRP. VB PRP$ NNS CC PRP VBP DT CD NN IN VBG PRP$ RB. PRP CC PRP$ NN
VBP IN JJ NN IN RB IN DT NNS. VB NN TO VB DT NN NNP. PRP VBP RB DT RB JJ
NN IN RB VBP RB VBP IN VBG MD VB JJ. VB VB PRP NNP. VB IN JJ JJ JJ NN
IN PRP. PRP VBZ IN TO PRP RB NNP!

 This is the ransom note with all the words and content deleted, leaving only the Penn Tree Bank Tags such as Nouns and Adjectives. So we have minimised the text to it's basic lexical structure of 36 tags.


We do this with all of Patsy notes, about 15 000 words, and John Ramsey's letter of 10 000 words. We also add in two genuine ransom notes, the short Robert Wiles notes and the very long 982 word ransom note from the Barbara Mackle kidnapping.


Running all the POS TAGS in R using the brilliant Stylo R Package and running the Consensus Bootstrap Tree, we get this output:






Using NO words, only parts of speech, the POS structure of one of Patsy's notes is similar to the ransom note, while the other ransom notes get binned together as being similar,and the two Christmas notes get put together too.

Using a clustering algorithm, where the closest most similar to clumped together, this dendogram is produced on the twenty most frequent POS tags:




This lumps Patsy with the ransom note, her other notes similar to John, and the real kidnapping notes from Wiles and Mackle are on the outskirts of Patsy and the JonBenet ransom note.


Now, asking the software to classify who wrote the note, or more accurately, who is the closest match and using one of the most best classifiers proven to have a good track record in authorship, the SVM classifier, Patsy is determined to be the author.


Using one of the most recent and powerful algorithms in  determining the distance ie the closeness of match is the Burrows Delta, which is included in the package, as well as modifications such as Eders Delta and Argamons Delta....the output is again Patsy as the author.


Is there a way to get more linguistic structure out of the writing ie more information than POS Tags can give us?


Yes there is. This brings us to:

Syntactic Ngrams

Part 1 - Parsing Text To Create A Dependency Tree:


Recall, POS Tags (above) give us lexical structure, a word is replaced with a verb or noun tag, but tells us nothing about the syntactic dependency tree structure; telling us what is the subject and object of the sentence is, which word is at the head (root) of the tree and so on.

We are now going to extract syntactic information. This is very different to POS Tags/ Parts Of Speech.

http://demo.ark.cs.cmu.edu/parse/about.html

What we extract with syntactic parsing is the tree structure of a sentence -- which word is the object, which word is dependant on another, and to create a tree structure that is non linear. This means the words in a sentence are not listed by the parser in the order they are written, but in the order assessed to be syntactically correct according to a dependency tree.


The critical take away point from this is that syntactic structure is NON LINEAR, meaning the order of the sentence from the parser is different to how it was written. The state of the art Stanford Parser has an accuracy of about 97% and reveals reveals the syntactic structure of text without words, as a first step!


An example of  the parser output for the sentence:


The boy with the brown eyes ate the cake.


det(boy-2, The-1)
nsubj(ate-7, boy-2)
case(eyes-6, with-3)
det(eyes-6, the-4)
amod(eyes-6, brown-5)
nmod(boy-2, eyes-6)
root(ROOT-0, ate-7)
det(cake-9, the-8)
dobj(ate-7, cake-9)

Root is at the top of the tree, above that is a noun modifier, and brown at -5 (5th word) is dependent on eyes at -6. There are around 50 tags from the dependency parser, such as determiners, noun subjects etc.


Onwards now to:

Part 2- Ngrams


Ngrams have been used for a long time and are one of the most reliable indicators of authorship (Sidorov 2014). Ngrams can be characters or words. You can think of it as a sliding window:


Using the above sentence again which comes from Google powerpoint presentation about their ngrams:

The boy with the brown eyes ate the cake.

A bigram or 2 unit ngram is a 2 word sliding window:

The Boy, Boy With, With The, The Brown and so on.

A trigram is 3 words or character unit (word in our example) and goes like this:
The Boy With, Boy With The, The Brown Eyes, Brown Eyes Ate and so on.

Two to five ngram units are the most useful in authorship (Sidorov).



Part 3 - Syntactic Ngrams


The final piece to this puzzle is the syntactic ngram. Google has used them to index several million books and 320 billion ngrams, with it's ngram viewer:


https://books.google.com/ngrams


This is a simplistic interface though, and can only be used for frequencies, however there is more sophisticated analysis possible by downloading the Google ngram data.


Notice a problem in the last trigram string above:


Brown Eyes Ate


This is obviously misleading and won't help with the text analysis of that sentence ie the subject is missing. You never get this output when you use syntactic ngrams, so they are far more powerful, contain more information and are more relevant to the text being analysed!

And once again, the beauty with syntactic ngrams is that they are non linear, they contain structure information in a different order according to the parser tree.

As mentioned, this example is from a Google presentation as they explain the purpose of their ngram viewer.

But there is more power in these little guys yet!
Thanks to Dr. Gregori Sidorov, we can produce mixed syntactic ngrams which he calls sngrams--you can mix the syntactic tags from the parser with POS tags (above) or words or lemma.

You now have mixed sngrams, or sngrams with relations, which he calls snrgrams.

He has a site and software in Python to create various sngrams in different sizes along with some interesting papers:

http://www.cic.ipn.mx/~sidorov/



The take away point from this is that text goes into the Stanford Parser, that output from that goes into the sngram software, the output from that is sngrams or snrgrams (if you mixed them) of various sizes ie bigrams trigrams etc.

Long story short--these snrgrams have been shown the be the most powerful use of ngrams in various applications! 

http://www.g-sidorov.org/Sidorov2014_IJCLA.pdf


The JonBenet ransom note is coded as a 2 unit SNRGRAM (bigram) with Syntactic tags and POS tags.

WE are using the power of syntactic tags and syntactic POS tags containing more linguistic structure information than ever.

The output of the ransom note looks like this:

root[RB] nmod[IN] root[NNS] root[VBP] acl:relcl[NN] dobj[DT]

acl:relcl[WDT] root[PRP] dobj[JJ] root[DT] nmod[VBP] ccomp[IN]
ccomp[PRP] root[NNS] dobj[PRP$] cc[CC] root[VBZ] root[PRP] conj[DT]
dobj[RB] dobj[NN] nmod[IN] dobj[PRP$] nmod[DT] nmod[PRP$] root[NN]
root[PRP] conj[VBP] dobj[PRP$] advcl[IN] advcl[VB] advcl[PRP] conj[NNS]
nmod[DT] conj[PRP] root[JJ] conj[NN] nmod[TO] xcomp[TO] root[VBZ]
root[CC] root[PRP] root[VB] conj[MD] xcomp[CD] nmod[IN] dobj[$] root[CD]
nmod[PRP$] root[NN] root[PRP] root[MD] nmod[IN] nmod[$] conj[JJ]
conj[NNS] root[CD] nsubj[$] root[$] root[IN] root[CC] conj[DT] root[VB]
conj[$] amod[CD] root[MD] ccomp[IN] ccomp[PRP] root[VBP] dobj[JJ]
nmod[DT] dobj[DT] root[JJ] nmod[TO] ccomp[NN] dobj[NN] nmod[IN]
root[VBP] advcl[PRP] nmod[DT] advcl[NN] nmod[NN] root[NN] root[PRP]
nmod[JJ] advcl[WRB] root[MD] dobj[DT] dobj[NN] nmod[CC] nmod[IN]
nmod:tmod[RB] advcl[PRP] advcl[NN] root[NN] root[PRP] advcl[TO] root[VB]
dobj[CD] root[MD] nmod[CD] xcomp[VB] root[VBP] advcl[JJ] advcl[PRP]
advcl[IN] xcomp[TO] nsubj[DT] root[NN] root[VB] root[MD] xcomp[VB]
xcomp[PRP] conj[RB] advcl[VBG] root[JJ] nmod[PRP$] advcl[IN] dobj[JJR]
conj[DT] nmod[DT] root[PRP] root[MD] root[VBP] advcl[PRP] dobj[DT]
dep[PRP] xcomp[TO] dep[RB] nmod[IN] conj[NN] dep[NN] conj[JJR] dobj[CC]
xcomp[NN] dobj[NN] nmod[IN] nsubj[NNS] nmod[DT] nmod[PRP$] nmod[NN]
nsubj[DT] root[NN] nmod[JJ] root[MD] nmod[IN] root[NNS] dobj[PRP$]
root[NN] root[PRP] root[VB] nmod[JJ] root[RB] root[MD] xcomp[PRP]
root[NNS] dobj[PRP$] advcl[VB] advcl[PRP] root[RB] root[VBP] xcomp[RB]
acl[RP] xcomp[TO] nsubj[DT] root[PRP] advcl[IN] nsubj[VBG] nsubj[CD]
acl[NN] nmod[IN] nmod[VBN] nmod[FW] nmod[NNS] root[VBG] case[IN]
nmod[PRP$] nmod[TO] nmod[NNP] nmod[NN] root[NN] csubj[NN] nmod[JJ]
root[MD] acl[VBG] root[VBP] advcl[PRP] nmod[DT] advcl[VBG] dep[PRP]
nmod[TO] dep[NN] root[PRP] advcl[IN] nmod[JJ] advcl[PRP] root[VB]
advcl[NNS] root[PRP] advcl[IN] dobj[NN] advcl[VBN] acl[VBN] advcl[DT]
acl[IN] advcl[NN] advcl[VBZ] acl[CC] nsubj[DT] root[NN] root[PRP]
advcl[IN] nmod[IN] root[NNS] advcl[DT] advcl[IN] conj[PRP] root[VBZ]
root[CC] root[PRP] root[VB] nmod[JJ] root[MD] advcl[VBP] conj[VBN]
ccomp[IN] xcomp[PRP] conj[VB] ccomp[NNS] conj[JJ] nmod[IN] nmod[NNS]
ccomp[PRP] nmod[CC] nmod[NN] root[MD] xcomp[TO] root[CC] root[PRP]
ccomp[VBP] root[VB] nmod[NNP] root[VBN] xcomp[PRP] acl[NN] dobj[PRP$]
advcl[VB] advcl[PRP] xcomp[RB] dobj[DT] acl[IN] xcomp[TO] root[NN]
root[PRP] advcl[IN] dobj[NN] amod[CD] acl[VBP] dobj[VBG] root[NNS]
root[VBP] conj[PRP] dobj[DT] acl[IN] conj[NN] acl[NN] root[CC]
dobj[PRP$] dobj[VBG] amod[CD] dobj[NN] root[NNS] root[VBP] cc[RB]
nsubj[CC] root[JJ] root[IN] conj[PRP$] cc[IN] root[PRP] conj[DT]
nsubj[NN] root[RB] dobj[DT] xcomp[TO] root[VB] xcomp[NNP] root[RB]
dobj[NN] ccomp[IN] acl:relcl[VBP] root[VBP] root[RB] advmod[IN] root[JJ]
acl:relcl[JJ] ccomp[MD] ccomp[NN] root[PRP] amod[RB] root[VB] ccomp[VB]
root[DT] acl:relcl[RB] root[VB] xcomp[PRP] root[RB] root[NNP] nmod[IN]
dobj[NNP] dobj[DT] root[NN] dobj[JJ] advmod[RB] advmod[PRP] nmod[TO]
root[VBZ] root[PRP] root[IN] root[RB]


Again, there are no words here, just ngrams with syntactic structure that is NON LINEAR, not in the same order as written.


Doing this for all the text as before and using the Stylo R Package software gave the following results...

Using the single word analysis in Stylo  with various occurrences of the most frequent sngrams, was nearly the same as using only 2 characters from the sngrams, which was nearly the same as using 4 characters sngram combinations with frequencies up to 500 most used sngram character combinations--they all redflagged Patsy as the most likely author!









In other words, this was the most stable output of any analysis I have done over a whole range of settings, showing that the sngrams contain my relevant syntactic information, despite the lack of words!


As a final note, I should mention that I used the sngrams as input into Jstylo, the authorship attribution software from Drexel University, and just like the results above, increased the probability of Patsy being the author. Using the same Enron Corpus etc from my earlier post, the sngrams increased the likelihood of Patsy being the author.

Let me know if you have any questions. 

A project I have in the pipeline is use sngrams for lie detection in written statements.

Coming soon! 

2 comments

  1. Great article. It's a shame that this type of AI-based analysis has not been established in forensics. If done properly, the result could be as reliable as that of DNA analysis (if not more so, since we are certain of the source of our data).

    It would be interest to get some p-values into the result in order to assess their statistical significance (the probability that the conclusions were not reached by chance). If you get significant p-values you should publish your work in an academic journal.

    ReplyDelete
  2. Hi Asimos, thanks for your comments. Regarding P values, I am using NPC TEST, non parametric combination software http://static.gest.unipd.it/~salmaso/NPC_TEST.htm and also the R script version by Devin Caughey of MIT. This makes no assumptions about the distributions and allows analysis for small samples. I haven't used it for this analysis but I am gearing up for major new posts. I haven't updated for a long time, getting up to speed with various statistical procedures and graphics. I will have all new material shortly. Thanks for your support and suggestions, I take on board what you say about P values. Cheers.

    ReplyDelete

© ElasticTruth

This site uses cookies from Google to deliver its services - Click here for information.

Professional Blog Designs by pipdig