Posts tagged data
Thank you, internets, for all the feedback I’ve gotten on BoomTime: Risk As Economics. Of course my slides are nigh indecipherable without my voiceover, and my notes didn’t make it to the slideshare, so here are some notes to fill in (some) of the blanks until the video hits YouTube (SiRA members will get early access to SiRAcon15 videos via the SiRA Discourse forum, BTW). (You will want to look at the notes and the slides side by side, probably, as one doesn’t make sense w/o the other.)
An intro here is that in addition to being a product manager specializing in designing large-scale, data-driven security/anti-fraud/anti-abuse automation (yep, that’s a thing), I’m also an economics nerd. (Currently working on an MS in Applied Econ at JHU). Given my background in payments, and a general penchant for “following the money”, framing technology problems on platforms through an economic/financial lens is second nature.
Themes of Security Economics
A list of typical themes one hears when discussing information security & economics: within businesses we are requested to talk about exposures and threats in terms of financial impact, or consider the financial (money) drivers. Also the theme of information asymmetries (Market for Lemons) is a big theme of information economics and of software markets in general: when information about quality of a product is difficult to find, that lack of transparency drives down prices, and we get less incentives to improve quality. (Ask me questions about market signals as a mechanism for correcting information asymmetries.) “Make it more expensive for the attacker” or “don’t outrun the bear, outrun the guy next to you” is also an idea that gets raised. Game theory, concepts of quantifying “risk” (exposure, tolerance), markets for exploits & vulns is a hot topic at the moment, as is behavioral economics and all things related to incentive design – gamification being the most buzzwordy example, perhaps, but framing as a method for improving consumers’ ability to make good choices related to privacy preferences also something that has come up a bit lately in security economics research. Anyway, these are some themes that tend to be repeated in recent research literature.
I just finishing giving a third version of a presentation that I put together on lessons Infosec/Risk/Platform owners can learn from classic Operations Research/Management Science type work. The talk (“Operating * By the Numbers”) was shared in Reykjavik (Nordic Security Conference), Seattle (SIRACon 2013), and in Silicon Valley (BayThreat). Thanks everyone who attended, especially those of you who asked questions and provided feedback.
A few folks have asked for reading lists. Some asked for the quick run-through sample from my bookshelf, others want some further reading. Here’s the quick run through:
- Introduction to Mathematical Statistics and Its Applications (5th Edition), Richard J. Larsen and Morris L. Marx
- Out of Control: The New Biology of Machines, Social Systems, & the Economic World, Kevin Kelly
- The Illuminatus! Trilogy Robert Shea & Robert Anton Wilson
- How to Protect Yourself from Crime, Ira Lipman (Guardsmark)
- Hackers: Heroes of the Computer Revolution – 25th Anniversary Edition, Steven Levy
- Computer Crime: A Crimefighter’s Handbook, David Icove, Karl Seger, William VonStorch
- Maximum Security: A Hacker’s Guide to Protecting Your Internet Site and Network, Anonymous
- Information Security Risk Analysis, Thomas R Peltier
- A First Course in Probability, Sheldon Ross
- Strategy, Basil H. Liddell Hart
- Mostly Harmless Econometrics: An Empiricist’s Companion, Joshua D. Angrist and Jörn-Steffen Pischke
- The Dilbert Principle, Scott Adams
- Introduction to Topology: Third Edition, Bert Mendelson
- Exploratory Data Analysis (Quantitative Applications in the Social Sciences), Frederick Hartwig with Brian E Dearing
- Game Theory Evolving: A Problem-Centered Introduction to Modeling Strategic Interaction (Second Edition), Herbert Gintis
- Practical Statistics Simply Explained (Dover Books on Mathematics), Russell Langley
- Excel Data Analysis For Dummies, Stephen Nelson
- Operations Management: Contemporary Concepts, Roger Schroeder
And I also want to give another shout-out to Combat Modeling, by Alan Washburn and Moshe Kress, of the Naval Postgraduate School. It’s a pricey text, but take a look at the table of contents & the topics they cover. Really interesting work to consider for control system designers.
Also, I haven’t read these personally but they are on my “to read” list as they came recommended by fellow quant/risk nerds:
- The Principles and Applications of Decision Analysis : 2 Volume Set, Ronald A. Howard and James E. Matheson
- Decision Analysis for the Professional (pdf link), Peter McNamee & John Celona
And here’s a link to one of my blog posts (Quant Ops), which includes a few references and some thinking on the topic from a different angle.
I was distracted earlier this week by a thread on the SIRA mailing list. I found myself reacting to an comment that suggested maybe quantitative risk mgmt seems is “just” plain ol’ SIEMs plus some stats/machine learning. That ended up being a bit of a hot button for a few folks on the list, because then there was a very interesting discussion that got going about data architecture options versus how common security-industry tuned tools work, which is worth a whole dedicated discussion. In any case it put me into a contemplative mood about SIEMs, since I am of two minds about them depending on what environment I’m working in: it’s the “any port in a storm” vs “when you have a hammer everything looks like a nail” thing. But regarding SIEM vs databases, or anomaly detection vs ML, or whatever:
- While acknowledging that apples and pears are both fruit, some people prefer to cut their fruit (agnostic to apple-ness or pear-ness) with very sharp ceramic knives vs, say, good ol’ paring knives, depending on dish being prepared.
- That said, the bowl you put fruit salad into may need to be different (waterproof, airtight, bigger) than a bowl one puts whole fruits in.
- Also, in an even more Zen-like tangent: no matter what bowl or what fruit or what knife is being selected, if you’re making fruit salad you’re going to have to spend some time cleaning the fruit before cutting and mixing it. If the bowl the whole fruits were in is especially dirty, or say, a crate – or a rusty bucket – you may want to spend more time cleaning.
I was going for something Zen.
But I’m not very Zen, I’m pedantic, so here’s some explanation of the analogy:
Apples & Pears are both fruit
- System logs are data that is usually stored in logfiles. Security devices generate system logs, and so do other devices. Errors are often logged, or system usage/capacity. Servers, clients, applications, routers, switches, firewalls, anti-virus systems — all kinds of systems generate logs.
- Financial records, human resource records, customer relationship management records are data that are usually stored in databases. Some may be generic databases, others may be built specifically for the application in question.
- There are also data types that are kind of a cross between the two, for example – a large consumer facing website may have account data. You are a customer, you can login and see information associated with your account – if it’s an email service, previous emails. If it’s an e-commerce site, maybe you can see previous transactions. You can check to make sure your alma mater or favorite funny kitten gif is listed correctly on your account profile. It’s not system logs, and it’s not internal corporate records – it’s data that’s part of the service/application. This type of data is usually stored in a database, though there might be metadata associated with the activity stored in logs.
- In another mood, I might delve further into this criss-cross category, which often results in a “you’ve got your chocolate in my peanut butter…you’ve got your peanut butter in my chocolate” level of fisticuffs.
- But, it’s all DATA.
People have different tool preferences when it comes to cutting fruit
Some capabilities of data-related tools/capabilities:
- Comparing across tables
- Pattern analysis / visualization
- Frequency analysis
- Simple mathematical operations (addition, subtraction, ranking)
- More advanced mathematical operations (exponential functions, regressions, statistical tests, quantile analysis)
- Sentiment analysis or text/string mining
- Blah blah etcer-blah
Basic capabilities tend to be common, or directly comparable, across tools. For example, here’s an article that compares some of the commands that can be used in a traditional SQL database to similar functions in Splunk, a popular SIEM.
The point is, while many tools have many of the desired features, there may be tradeoffs. A product might make it really easy to conduct filtering (via an awesome GUI and pseudocode) and still have limitations when it comes to extracting a set of events across multiple tables that meets ad hoc-developed, but still quite technically specific, criteria. Or, a tool might excel in rapid access to recent records, but crash if there’s a long-term historical trend to analyze. Or, it can be a gem if you’re trying to do some statistical analysis of phenomena but too resource intensive to be used in a production environment.
People have different use cases for cutting fruit
- In some cases data is kept only to diagnose and resolve a problem later
- In some cases data is kept in order to satisfy retention requirements in case someone else wants to diagnose/confirm an event later
- In some cases data is kept because we’re trying to populate a historic baseline so that in the future we have something against which to compare current data
- in some cases data is kept so that we can analyze it and predict future activity/behavior/usage
- In some cases data is kept because it is part of the service / product being supported
Ops is different from Marketing. Statisticians are not often the same people doing system maintenance on a network. Etc.
The container for your data only matters if the container has special properties that facilitate the tools you’re going to apply, your use case for storing the data, or your use cases for processing/manipulating the data. A big use case in the era of always-on web-based services is special containers designed to allow for rapid manipulation and recall of Very Large amounts of data.
- SIEM architecture – “SIEM” is a product category vs a description of architecture, different products may have different architectures, here are a few examples. Typically a SIEM accepts feeds from devices generating logs, and then have functions to consolidate, sort, search, and filter. Here’s how Spunk describes itself:
“Splunk is a distributed, non-relational, semi-structured database with an implicit time dimension. Splunk is not a database in the normative sense …but there are analogs to many of the concepts in the database world.”
Which architecture is the best is a silly question; they are architected differently on purpose. Pick a favorite if you must, but if you work with data, be prepared: you’ll probably not often find yourself in homogenous environments.
About working with data
No matter where your data is sourced, if you want to do something snazzy like use it to train a neural net, or do a fun outlier analysis, then you’re going to have to spend a great deal of time prepping your data, including cleaning it. Some many database architectures claim to make this process easier (I’ve yet to meet an analyst that’s ever described this part of analysis as fun or easy), what’s definitely true is some data storage formats / practices make it harder.
- If your data unstructured – like you might find in key-value pair or document stores – you might have significant work to get it into a more structured format, depending on what research methods you are going to use to conduct your analysis.
- Even with relatively structured data you might find that for one purpose formatting is relevant but when you get to the analysis stage you need to further simplify.
The cooler things we might discover require working with more complex (i.e. less structured) data, which is why advances in manipulation of less structured data, and algorithms that are forgiving of different types of complexity are fun. Sometimes it’s the analytic technique that’s new, sometimes it’s the technology for applying it, but often the “coolness”, or at least the nerdy enthusiasm, is from applying existing techniques & tech to a new data source, OUR data source, to answer OUR question – in a way that hasn’t quite been done before. That’s kind of how research is.
Stop worrying so much about your bowls. Unless the lid is on so tight that you can’t get your fruit salad out.
Recently, I was interviewed for the ActiveState blog on DevOps & Platform as a Service (PaaS); that interview made it to Wired.com (here). A discussion on the topic was timely, as I’ve been thinking about DevOps and other agile delivery chain mechanisms quite a bit lately, mainly as I am applying them in my current gig which my colleagues are I describe as “Business Ops”. Next month at Nordic Security 2013 I’ll be presenting “Operating * By the Numbers” (If you’re wondering why there’s no abstract, it’s because I’m still perfecting “Just In Time” deck development…just kidding. Sort of.*)
Anyway, I thought it might be a good idea to explain What I’m Talking About When I Talk About DevOps (apologies to the incomparable Haruki Murakami). This will be my first time trying to explain where I’m going with this whole DevOps thing, so it might get fuzzy. Bear with me. I reserve the right to change my mind later, of course (I’m cognitively agile that way, haha), so if you have comments or criticisms I’m very open to hearing your thoughts.
Connection between DevOps & Risk
DevOps, if you’ve not heard of it before, is a concept/approach to managing large-scale software deployments. It seems to be most popular/effective at software-based or online services, and it is “big” at highly scaled out companies like Google, Etsy, and Netflix. Whether consumer-facing or B2B, these services need to be fast and highly-reliable/available. The DevOps movement is one where deployments and maintenance are simplified (simplicity is easier to maintain than complexity) through standardization and automation, lots of instrumentation & monitoring, and an integration of process across teams (most specifically, Dev, QA & Ops). More on “QA” later.
But…the thing about DevOps is, that while it is a new concept in the world of online services, it draws heavily from Operations Management, which is not new. The field of Operations Research was forged in manufacturing but the core concepts are easily applied across other product development cycles. In fact this extension is largely overdue, since a scan through semi-recent texts on operations management shows IT largely described as an enabling function (e.g. ERP) but not a product class in and of itself. (BTW, in some curriculums, Operations Management is cross-listed or referred to as Decision Science, which is a core component of risk/security analytics.)
Author: George Orwell (i.e. Eric Arthur Blair) (1903-1950)
Challenge status: #9 on Radcliffe Publishing Course Top 100 Novels of the 20th Century and frequent target of banning attempts according to the ALA’s Office for Intellectual Freedom. Book #4 on Summer of Banned Books ’13.
Why: Well, when challenged in Florida in 1981 the reasons given were that the book was “pro-communist and contained explicit sexual matter.”
First line: “It was a bright cold day in April, and the clocks were striking thirteen.”
Synopsis: The foreboding classic view of a future that is now partially here: a totalitarian regime that effectively controls not only the behavior but the very thoughts and memories of it’s citizens. Winston Smith is not a loyal member of the party: he has questions and doubts that end up pulling him into a theoretical resistance movement and into the arms of a fellow disbeliever (his lover Julia), both from which he is eventually saved via an active re-education that takes place deep in his heart and within the Ministry of Love (Miniluv).
Thanks to Orwell we now have some amazing vocabulary (thoughtcrime, Big Brother, newspeak, doublethink, unpersons) and concepts (entertainment screens that broadcast while conducting surveillance, mini-helicopters and microphones hidden in plain sight – always collecting data, office workers who’s whole function is to “correct” the news to reflect the current truth, party practices destabilizing bonds between family members as a method of distributing policy enforcement, a government that creates tabloids, lotteries, and pornography to keep the proletariat subdued, armies that bomb their own citizens to further the image that the country is at war, politicians that expend all surplus resources as part of useless skirmishes to keep the populace hungry and angry – never really seeking to change balances-of-power between the primary competing nation-states).
Over the last year I’ve started reviewing game theory in more depth, looking for some models I can use to understand system management (vis a vis risk) better. Game theory is one of the more interesting branches of economics for me, but I don’t actually have a great intuition for it yet (I really have to work at absorbing the material). Since it doesn’t come super-naturally to me, I’m particularly proud of the presentation I gave at SOURCE Boston last year: Games We Play: Defenses and Disincentives (description here). Luckily, there is a good video of the presentation, because when I wanted to expand out the presentation a few months later, my notes were totally undecipherable. 🙂
Since I am still a proponent of applied risk analytics (as in my talk at Brucon this year: A Million Mousetraps: Using Big Data and Little Loops to Build Better Defenses (description here), I’ll never be able to escape behaviorally-driven defenses, but even with the power of big data behind us it feels like we defenders often find ourselves playing the wrong game. I don’t disagree the deck might be stacked against us, but we might be able to at least take control of the game board a little better.
Essentially — I am interested in we how might be able to adjust incentives in order to improve both risk reduction, whether from a fraud, security, or general operational dynamics perspective. Fraud reduction typically considers incentives and system design rather vaguely (not in a systematic way, except maybe in the case of authentication), and instead relies almost exclusively on behavioralist approaches (as typified by the complex predictive models launched looking for patterns in real time. I have been wondering for a while if we can “change the game” and get improved results.
A little blog post.
So, it’s been about two years since I added anything to this blog. I’ve been busy!! The awesome folks at SOURCE gave me a speaking slot at SOURCE Boston 2010 and that kicked-off a series of talks on methods consumer-facing companies/websites take to protect customers from online threats. And then later in 2010 was able to participate in some discussions on different types of threat modeling and situations in which modeling techniques can be useful.
In 2011 I wanted to talk about some more concrete topics, and so spent some time researching how threats/impacts can be better measured. This is an area I’d like to spend more time researching, because there’s still a gap between what we can do with the the high-frequency/lower-impact events (which seem to be easier to instrument, measure, and predict) and the lower-frequency/high-impact events (which are very difficult to instrument measure, or predict). –> I think the key is that high-impact events usually represent a series or cascade of smaller failures, but there’s more research into change management and economics to be done.
Later in 2011 I switched over to describing how analytics can be used to study and automate security event detection. I hope in the process I didn’t blind anyone with data science. (haha…where’s that cowbell?) So here’s what I did: (more…)
Risk management at a systemic level is complicated enough that many organizations deem it practically impossible. The mistake many risk managers make is to try to identify every potential exposure in the system, every possible scenario that could lead to loss. This is how risk managers go crazy, since not even Kafka can describe every potential possibility. Risk management as a discipline does line up nicely with probability theory, but holistic approaches to risk management deviate from the sister science of insurance.
Insurance presents expected value of specific events taking place: what is the probability this car and this driver will be involved in a collision — and how much will the resulting damage cost to replace/fix? Factors include the age and quality of the car as well as the age and quality of the driver, average distance driven per day, geographic area and traffic conditions. The value of the vehicle is estimated, ranges of collision costs assumed. Flood insurance is similarly specific: what is the probability this property will sustain damage in flood conditions — and how much will it cost to protect/fix the property? Average precipitation, elevation, foundation quality, assessed property value are all factored into the decision.
As complicated as actuarial science is, insurance can be written because insurance is specific. Risk management is not specific: it is systemic.
This post is a first in a series I will be exchanging with Ohad Samet (ok, second, he’s a much quicker blogger than I am), one of my esteemed colleagues in Paypal Risk, and the mastermind behind the Fraud Backstage blog. Read Ohad’s article here.
Despite best efforts to protect systems and assets using a defense-in-depth approach, many layers of controls are defeated simply by exploiting access granted to users. Thus the industry is trying to determine not only how we protect our platforms from external threats, but also how we keep user accounts from being attacked. User credentials being the “keys” (haha) guarding valuable access to both user accounts and to our platfoms, a popular topic among the security-minded these days center around alternatives to standard authentication methods. Typically, the discussion centers not around how an enterprise secures its own assets and users, but about arming consumers who come and go across ISPs, search sites, online banking, social networks…and are are vulnerable to identity theft and privacy invasions wherever they roam.
How many information security professionals does it take to keep a secret?
While there are a number of alternatives out there, focusing on authentication as if it’s a silver bullet misses the point. When we assume that keeping our users secure means protecting (only, or above all other things) the shared secret between us, it leaves us over-reliant on simple access control (the fortress mentality) when as an industry we already know that coordinating layers of protection working together is a more effective model for managing risk. To clarify our exposure to this single point of failure, let’s consider:
2) How does our risk model change if we assume all credentials have been compromised?
Shall We Play a Game…of Twenty Questions?
Really all this nonsense started when we started teaching users to use “items that identify us” as “items that authenticate us”. Two examples, SSN and credit card numbers. SSN we know has been used by employers, banks, credit reporting agencies…as well as for its original purpose, to identify participation in social security (this legislation being considered in Georgia may limit use of SSN and DOB as *usernames* or *identifiers*, although it is silent on using SSN/DOB to verify/authenticate identity).
VLAB (web, twitter ) — the MIT/Stanford Venture Laboratory — hosted a session 1/18 at Stanford on “Data Exhaust Alchemy – Turning the Web’s Waste into Solid Gold“.
I’d never heard the term data exhaust (or digital exhaust, thank you wikipedia ), but it’s a handy idea. The proliferation of social media and the internet transformed “media” (generally media is considered a one-way push system) into “social” (the personalization and intimacy of narrowcast, or at least a many-to-many set of connections). Everyone set on broadcast, and all broadcasts stored for perpetuity. If you like data (and I do) and you are interested in how people act, interact, and think (and who isn’t) the idea that all those feeds and updates and comments would be going to waste — well, it’s heartbreaking.
Here’s the panel:
- Roger Magoulas (http://www.oreillynet.com/pub/au/2717), O’Reilly Media, big data http://radar.oreilly.com/2010/01/roger-magoulas-on-big-data.html, geolocation, former Sybase research team
- JB (Mike John-Baptiste), PeerSet, former JumpTV
- Dr. DJ Patil, LinkedIn, former eBay Analytics/Architecture lead, social network analysis, weather complexity research, former AAAS DoD Science & Tech Policy Fellow, Voldemort, Hadoop geek
- Jeff Hammerbacher, Cloudera, former Facebook data team lead (Hive, Hadoop, Cassandra)
- Mark Breier, In-Q-Tel, Blogger (http://10secondtips.blogspot.com)
- Pablos Holman, Intellectual Ventures Laboratory (http://en.wikipedia.org/wiki/Intellectual_Ventures), Komposite, “notorious hacker”
The basic premise of the panel appealed to the entrepreneurial marketer in all web 2.0 hopefuls — consumers are telling you their preferences (loves, hopes, fears, feedback) faster than can be processed, and companies that analyze and parse the data the best will cash in by designing the best promotions and products for the hottest, most profitable market segments. In addition to the new faddish feeds of facebook, twitter, foursquare, and blippy, some classic sources of consumer preference were mentioned: credit card data, point-of-sale purchase (hello grocery store shopper’s club), and government-held spending data. Everyone knows there’s gold in dem dere hills — like mashing up maps and apartments for sale, or creating opt-in rewards programs so you can figure out not just the best offer for Sally Ann, but also for all of her friends.
Philosophically, the reason we’ve been stuck with an advertising model (broadcast brand-centric messages rather than customer-centric, tailored promotions) is because we’ve never had a mechanism for scaling up a business to consumer communication — that’s tuned to the consumer. Now with the Web and current delivery mechanisms (inbox, feed, banner, search placement, etc.) not only can companies talk to their consumers — the consumers will engage in conversation. The better you know your customers, the better your ability to reach them (promote) and meet their needs (product). As long as you know how to interpret the conversation correctly.
“What people say has no correlation whatsoever with real life, but what they do has every correlation with real life…” Mark Breier, In-Q-Tel
Quickly the conversation moved into some of the technier topics: while there’s a promise of reward for the cleverest analysts, there’s some tricky issues too. For one thing, processing exhaust data in a way that makes sense, at scale, requires some serious processing horsepower and an architecture that can accommodate “big data”. While less time was spent on the technical issues than I expected (with DJ from LinkedIn and Jeff from Cloudera on the panel, a full-on Hadoop immersion was a distinct possibility), time was spent both on the computer science research in the area (machine learning, natural language processing) as well as the (more interesting to me) current thinking on the most appropriate analytic methods. More time on sentiment analysis and entity extraction would have been great, but probably really require follow-up sessions.
Magoulas from ORA did a nice job outlining some of the issues consumers are having with social marketing: besides basic privacy and data ownership questions (who owns the pics you’ve uploaded to FB? And how should filters be set-up for consumers to remain in control of information they want to keep semi-private?), there are issues emerging around security/spoofing/hacking — meaning, how much public information is available for consumers to be targeted. (Marketers can guess your age and who your friends are, more nefarious entitites can try the same techniques to learn more private details). My favorite issue described was the “creepiness factor” — which I can definitely relate to — which is the idea that consumers may not react happily to services that know them *too* well.
Ultimately, analytic marketing is here to stay. Since consumers create data and leave digital detritus wherever they surf, leveraging that data will continue pay dividends as service providers nimbly design customer-centric offers and products. Still, concerns about how personal data (whether private public) is used by unknown intermediaries could either drive withdrawal from social media (if consumers opt to lock-down their profiles and share only in private circles) — or — a more hopeful thought, drive further innovation in this space. Look for savvy organizations to weave opt-in, consumer-driven agent technology with the power to aggregate (and segment) the behavior and preferences of wide communities of people.