Big Data and text analytics are all the rage in the field of law today. However, there is a dearth of precise definitions of these technologies and of the understanding of what they do. This is the void that the present article will try to fill.
Text analytics – number-based
Firstly, there is numbers-based text analytics. Say, I’m looking for documents about patents. Looking at each document in my set, I can count how many times the word “patent” occurs within them, if at all. The more the merrier: Documents that contain the word “patent” more often are probably more related to patents than other documents. So, by counting the number of times that “patent” is mentioned, I can group similar documents.
But there are also other words in these documents! No problem, we will compute the ratio of our target word “patent” to other words. But there are also other documents! For this, we will look at how many times the word “patent” occurs in other documents. If it occurs too much, then our document is nothing special, and should not be singled out as one talking about patent.
Thus, I can have a graph that will group together documents with similar word frequency and similar frequency of occurring in other documents. Here is an example of such a technique applied to find important words in newspaper articles.
This is number-based text analytics. Is it helpful? Probably yes. Does it really understand what the documents are talking about? Everybody will agree, far from it.
Based on the very simple number crunching described above, the computer can do the “show more documents like this” trick. Will it be extremely precise, or will it likely contain enough noise to become “not so useful” to put it mildly? Very likely so.
Beside this palpable need for improvement, the previous few paragraphs might have scared away anyone who is not into “data science” or “number crunching.” As the lawyers like to joke, “I went into law so that I would NOT have to do math.” Let’s respect this feeling.
Text analytics grammar-based
There is also a mature branch of text analytics dealing with the analysis of the language, or text-engineering. One example of this approach is GATE, a ten-year old project from Sheffield University, which has become one of the standard tools for text analytics. GATE stands for General Architecture for Text Engineering.
GATE can reasonably break down the text into paragraphs, sentences, nouns and verbs. It can also find people, companies, organizations and places mentioned in the documents. How does it do this? For example, the first rule for sentence detection is “It must end with a period (dot), and there must be a word starting with an uppercase letter after it.” As you can imagine, this rule will indeed catch perhaps 80% of the phrases found in documents overall. Then there will be some exceptions, and some additional rules to catch them. The detection will never be 100 percent perfect (even human reviewers may disagree on how to break up sentences), but it can get pretty accurate.
For people, companies, places, etc., GATE uses the concept of gazetteers. Gazetteer is simply a list of all possible values: For example, all counties in the U.S., if we are trying to detect the county court in a legal proceeding.
The screenshot below shows GATE configured to analyze US Court of Appeals documents. It can detect judges, courts, counsel, etc. The tag on the right is convenient: once you click on a checkbox which says “Date,” you will see all dates in all possible formats that GATE has detected in the given document.
(Click to enlarge image.)
Well, how do you use this display? You could read the document yourself! The answer is that technologies like GATE help you extract meaningful entities, adding them to the existing metadata fields. The standard metadata fields number a few dozen and include fields like “author,” “date created,” “date received,” “recipient,” etc. Using entity extraction, one can add many more fields dealing with the document contents, actors, places, etc.
That, in turn, will help to form an accurate picture of the case. For example, the screenshot below shows each document as a larger white circle, while people and places extracted from it are shown as smaller filled circles. If you can further act on this chart by drilling in or zooming out, it can be used as a very effective investigation tool.
(image courtesy of DARPA Memex. Click image to enlarge.)
The sheer number of documents may present a problem by itself. Let’s do a simple back-of-the-napkin calculation. A million documents may take a million seconds to process. This assumes that a document takes a second — not too unlikely if you think of opening this document with Microsoft Word. If you add optical character recognition (OCR) required for scanned documents, one second may even seem low.
A million seconds is about 20,000 minutes, or about 40 hours. Now, that’s too much. However, if I had a hundred computers to do the work, it would be only about 20 min. That is reasonable. And this is where Big Data comes in.
Big Data deals with too much data, more data than can be stored and processed on one computer. It is capable of connecting billions of people around the globe and of navigating billions of cars through millions of places. But for our purposes, we can simply think of it as a glue that combines our hundred computers into a cluster and lets them work together on solving one problem. This glue is called Hadoop. A more modern version, which uses in-memory computing and is much faster, is called Spark. Such feats as processing a million documents in twenty minutes on a hundred machines are commonplace there.
So what’s missing?
By: Mark Kerzner, Chief Product Architect, LexInnova
Originally published at https://bol.bna.com
This excerpt is from Bloomberg BNA. To view the whole article click here.