Machine Learning Times
Machine Learning Times
FAQ for Eric Siegel’s New Book, “The AI Playbook”
  There are plenty of questions to answer about...
Announcing Eric Siegel’s New Book: The AI Playbook
  Dear Reader, I’m excited to announce the forthcoming,...
Predictive Analytics for the Call Center
 So, you just received your shiny new smart watch....
MLW Preview Video: Gulrez Khan, Data Science Lead at PayPal
 In anticipation of his upcoming keynote presentation at Predictive...

10 years ago
IBM’s Jeff Jonas on Baking Data Privacy into Predictive Analytics


Privacy by Design, an outlook toward software development developed in the 1990s, urges companies to bake privacy protection features into its analytic systems and processes from their conception.

While many executives have supported the notion of anonymizing personal data when using it to gain insights into consumer behavior, few have come to personify the evolution of the practice as much as Jeff Jonas, an IBM Fellow and Chief Scientist of the IBM Entity Analytics Group.

Jonas, who is founder of Systems Research & Development (SRD), which IBM acquired in 2005, is best known for his innovative “sense-making” technology, which allows organizations to gather and analyze data from a variety of sources in real time. A version of that technology, known as G2, is embedded into the latest version of IBM’s SPSS Modeler software.

On the eve of its release, Jonas, who co-authored a 2012 paper with Ann Cavoukian, the Ontario commissioner on information and privacy who developed Privacy by Design, spoke with Data Informed about G2 and the principles of embedding personal data privacy in new analytics software.

Data Informed: How did you get involved in the Privacy by Design movement?

Jeff Jonas: Prior to G2, I’d been innovating around this class of software that integrates data from different sources and helps organizations make better decisions. In 2003, I created a way to allow somebody to integrate data but anonymize it first. It was created as a standalone, unique product just for more privacy protective data integration and analytics. As I’ve been hanging around with the privacy community and technologists who are passionate about this, one of the things I’ve witnessed –and I’ve sensed this in my own work – is that if you build something as a special feature or that costs extra as privacy, it doesn’t seem to get the same pickup.

One of my goals in the use of Privacy by Design in the G2 project was what kind of privacy features can I bake in that cost no more? In other words, they’re by default. They’re built in. In fact, a few of them, you can’t even turn them off. That way, someone’s not left there with a decision, ‘Yeah, we trust ourselves. I don’t have to pay extra for a privacy feature. I’d rather just buy more disk space.’”

So I came up with seven things in the paper that Ann [Cavoukian] and I wrote. Those are just baked in. They cost nobody any more money to do those things.

What aspects of Privacy by Design are built into G2?

Jonas: One of the Privacy by Design features that’s baked into G2 is the ability to anonymize the data at the edge, where it lives in the host system, before you bring it together to share it and combine it with other data.

I imagine all the different data in all the different systems being like different colored puzzle pieces, yellow pieces and blue pieces. What you want to try to do to make higher quality estimations of risk and opportunity, is that you want to put the green, yellow, magenta and brown puzzle pieces all next to another and see how they relate. So it’s like going from puzzle pieces to pictures. If you need to bring puzzle pieces together to see pictures, one of the Privacy by Design methods I’ve been building, called selective anonymization, allows you to anonymize things like social security number or driver’s license number or date of birth, whatever the features the company has, at the source system before you move it to the table where the puzzle pieces would come together to see how they relate. But even if a person sees those puzzle pieces, or steals those puzzle pieces, they can’t actually see your Social Security number or your date of birth. Because its non-human readable.

What is an example of how this innovation can be used in real-life situations?

Jonas: I just did a project with Pew Charitable Trust to help modernize voter registration in America. It allows election offices to share their data with each other, to improve the quality of the election rolls, so you don’t have people registered in multiple states. People are supposed to tell the election office that they’ve moved, so you get people registered in states where they don’t live. The election offices are using this selective anonymization feature. Name and address are public record. You can go purchase that. You don’t have to obscure it. But your driver’s license, Social Security number and date of birth are not.

So what happens is, election offices anonymize those values before it’s sent to the data center, where it’s combined with the other states’ data. If you were to sit in that data center and look at the data, you would not see a single Social Security number, date of birth or driver’s license. Yet it would be benefiting from that data. But it’s in a non-human readable form.

How does your technology differ from other de-identifying technologies?

Jonas: Our goal is not to hide the identity of the person. Our goal is to protect the values that you wouldn’t want to be revealed. For example, you can see the names and addresses. You know who it is. You’re not trying to make it an unidentifiable person, but the PII, the personally identifiable information, is modified in a way that is non-human readable and non-reversible. What you’re doing is you’re preventing the unintended disclosure of personal identifiers.

That feature we have shipping in the next version of our predictive analytics software called SPSS. We didn’t make it a feature you’d have to pay more for. It’s just free to everyone who is doing modeling with this SPSS model approach. Version 16 is shipping in the next week or two.

Before this technique, you’d actually have to match the data first and then you’d anonymize it to the analytics. That means you have to bring it all together in human-readable text to match it. After you’ve matched it, you make a copy without the features in it. But this is the first commercially available technique, I think, where you anonymize the features before you even match it.

The technique is called a one-way hash. If you have a pig and a grinder and make a sausage, if I gave you a sausage and the grinder, can you make a pig? It’s just one way. You can’t go backwards.

As big data analytics become more prevalent, how should privacy technology evolve?

Jonas: As analytics get more and more powerful, I think it’s responsible to build more and more privacy mechanisms into the technology, and, where you can, by default. I think that’s going to benefit the organizations that use it. Companies are strained by all the things that they want to do with analytics, and they’re asking themselves the question, “How can we be sure it’s not going to run away on us? How do we make sure it gets used properly?”

One of my seven Privacy by Design points is this tamper-resistant auto-log, where you’re recording how people have searched the system that even the database administrator can’t erase their footprints. It creates a chilling effect. Some people will go, “Wow, a tamper-resistant auto-log. I’m not even going to do that search.” We acquired a company called Guardium, and that’s how I got that piece. I didn’t invent it, because it was already invented by others. But it’s embedded in the G2 engine. It’s not available to the masses yet. A light version of G2 is what is shipping with SPSS. The big sense-making version includes Guardium in it. But the tamper-resistant auto-log, it’s not an option. It’s in the box.

Why are these functions so valuable to companies looking to process large quantities of data?

Jonas: A premise of G2 is, you’re trying to bring diverse data together, so you can have a more complete picture, so you can be more competitive, and give me the best ad, and recognize if someone is stealing my credit card. You’re trying to bring the data together to make those higher quality predictions. It still lives in the source systems. The question is, when you make a copy of data and put it into a big data system, it’s just another copy. Your risk of disclosure doubles. With these Privacy by Design principles, before you make the copy, you render it useful for some analytics but less useful for people stealing my date of birth and Social Security number.

Are there any other crucial privacy features embedded in G2?

Jonas: Another Privacy by Design feature that we can’t turn off is, every piece of data that you feed to the G2 engine, it remembers where it came from. We call it “full attribution.” It never takes two records and combines them into one and then forgets what pieces came from where. In G2, every time it gets a record, it keeps track of the source where it came from, and that record number in the source. In a sense, it’s tethered.

I’m doing some work in banking now related to money laundering. And one of the obligations that banks have is to look at OFAC [The Office of Foreign Assets Control]. There’s an OFAC list called Specially Designated Nationals, and it has thousands of identities and says we shouldn’t be doing business with these people. But if somebody gets removed from that list, how long should you wait before you stop interfering with their business? Well, from a Privacy by Design point of view, you’d say instantly. But if you don’t have full attribution in [the list] where all the data comes together, how would you know which puzzle pieces to take out?

What is the most important thing you’ve learned from your work with Privacy by Design?

Jonas: The number one thing I’ve learned from the privacy community — if I were to synthesize into the fewest number of words what I’ve learned – is, avoid consumer surprise. Collect the data and use the data in a way that, if revealed and on the front page of the paper, it would not create any consumer surprise. And that’s mainly a law and policy point of view, not much of a technology statement. For an organization to be competitive today, they’d better figure out how to make sense of their data, or they’re not going to be in business. And then the next thing after that is, how can you do it in a way that’s more responsible and reduce the risk of misuse that might damage their brand?

How receptive are companies to this viewpoint these days?

Jonas: I’ll tell you, an interesting thing changed about a year ago. Up until a year ago, I would be talking about analytics, and then go, “Hey, listen to this, here’s all the things you can do to make the system more privacy- and civil liberties-protected,” which I think you’re going to want. I would be the first to bring it up. What changed about a year ago is now I’m hearing customers bring it up first.

By: Alec Foege
Originally published at data-informed
Alec Foege, a contributing editor at Data Informed, is a writer and independent research professional based in Connecticut, and author of the book The Tinkerers: The Amateurs, DIYers, and Inventors Who Make America Great. He can be reached at alec [at] brooksideresearch [dot] com. Follow him on Twitter: @alecfoege.

Leave a Reply