Big-SIEM Learning Machinations
I was distracted earlier this week by a thread on the SIRA mailing list. I found myself reacting to an comment that suggested maybe quantitative risk mgmt seems is “just” plain ol’ SIEMs plus some stats/machine learning. That ended up being a bit of a hot button for a few folks on the list, because then there was a very interesting discussion that got going about data architecture options versus how common security-industry tuned tools work, which is worth a whole dedicated discussion. In any case it put me into a contemplative mood about SIEMs, since I am of two minds about them depending on what environment I’m working in: it’s the “any port in a storm” vs “when you have a hammer everything looks like a nail” thing. But regarding SIEM vs databases, or anomaly detection vs ML, or whatever:
- While acknowledging that apples and pears are both fruit, some people prefer to cut their fruit (agnostic to apple-ness or pear-ness) with very sharp ceramic knives vs, say, good ol’ paring knives, depending on dish being prepared.
- That said, the bowl you put fruit salad into may need to be different (waterproof, airtight, bigger) than a bowl one puts whole fruits in.
- Also, in an even more Zen-like tangent: no matter what bowl or what fruit or what knife is being selected, if you’re making fruit salad you’re going to have to spend some time cleaning the fruit before cutting and mixing it. If the bowl the whole fruits were in is especially dirty, or say, a crate – or a rusty bucket – you may want to spend more time cleaning.
I was going for something Zen.
But I’m not very Zen, I’m pedantic, so here’s some explanation of the analogy:
Apples & Pears are both fruit
- System logs are data that is usually stored in logfiles. Security devices generate system logs, and so do other devices. Errors are often logged, or system usage/capacity. Servers, clients, applications, routers, switches, firewalls, anti-virus systems — all kinds of systems generate logs.
- Financial records, human resource records, customer relationship management records are data that are usually stored in databases. Some may be generic databases, others may be built specifically for the application in question.
- There are also data types that are kind of a cross between the two, for example – a large consumer facing website may have account data. You are a customer, you can login and see information associated with your account – if it’s an email service, previous emails. If it’s an e-commerce site, maybe you can see previous transactions. You can check to make sure your alma mater or favorite funny kitten gif is listed correctly on your account profile. It’s not system logs, and it’s not internal corporate records – it’s data that’s part of the service/application. This type of data is usually stored in a database, though there might be metadata associated with the activity stored in logs.
- In another mood, I might delve further into this criss-cross category, which often results in a “you’ve got your chocolate in my peanut butter…you’ve got your peanut butter in my chocolate” level of fisticuffs.
- But, it’s all DATA.
People have different tool preferences when it comes to cutting fruit
Some capabilities of data-related tools/capabilities:
- Comparing across tables
- Pattern analysis / visualization
- Frequency analysis
- Simple mathematical operations (addition, subtraction, ranking)
- More advanced mathematical operations (exponential functions, regressions, statistical tests, quantile analysis)
- Sentiment analysis or text/string mining
- Blah blah etcer-blah
Basic capabilities tend to be common, or directly comparable, across tools. For example, here’s an article that compares some of the commands that can be used in a traditional SQL database to similar functions in Splunk, a popular SIEM.
The point is, while many tools have many of the desired features, there may be tradeoffs. A product might make it really easy to conduct filtering (via an awesome GUI and pseudocode) and still have limitations when it comes to extracting a set of events across multiple tables that meets ad hoc-developed, but still quite technically specific, criteria. Or, a tool might excel in rapid access to recent records, but crash if there’s a long-term historical trend to analyze. Or, it can be a gem if you’re trying to do some statistical analysis of phenomena but too resource intensive to be used in a production environment.
People have different use cases for cutting fruit
- In some cases data is kept only to diagnose and resolve a problem later
- In some cases data is kept in order to satisfy retention requirements in case someone else wants to diagnose/confirm an event later
- In some cases data is kept because we’re trying to populate a historic baseline so that in the future we have something against which to compare current data
- in some cases data is kept so that we can analyze it and predict future activity/behavior/usage
- In some cases data is kept because it is part of the service / product being supported
Ops is different from Marketing. Statisticians are not often the same people doing system maintenance on a network. Etc.
The container for your data only matters if the container has special properties that facilitate the tools you’re going to apply, your use case for storing the data, or your use cases for processing/manipulating the data. A big use case in the era of always-on web-based services is special containers designed to allow for rapid manipulation and recall of Very Large amounts of data.
- SIEM architecture – “SIEM” is a product category vs a description of architecture, different products may have different architectures, here are a few examples. Typically a SIEM accepts feeds from devices generating logs, and then have functions to consolidate, sort, search, and filter. Here’s how Spunk describes itself:
“Splunk is a distributed, non-relational, semi-structured database with an implicit time dimension. Splunk is not a database in the normative sense …but there are analogs to many of the concepts in the database world.”
Which architecture is the best is a silly question; they are architected differently on purpose. Pick a favorite if you must, but if you work with data, be prepared: you’ll probably not often find yourself in homogenous environments.
About working with data
No matter where your data is sourced, if you want to do something snazzy like use it to train a neural net, or do a fun outlier analysis, then you’re going to have to spend a great deal of time prepping your data, including cleaning it. Some many database architectures claim to make this process easier (I’ve yet to meet an analyst that’s ever described this part of analysis as fun or easy), what’s definitely true is some data storage formats / practices make it harder.
- If your data unstructured – like you might find in key-value pair or document stores – you might have significant work to get it into a more structured format, depending on what research methods you are going to use to conduct your analysis.
- Even with relatively structured data you might find that for one purpose formatting is relevant but when you get to the analysis stage you need to further simplify.
The cooler things we might discover require working with more complex (i.e. less structured) data, which is why advances in manipulation of less structured data, and algorithms that are forgiving of different types of complexity are fun. Sometimes it’s the analytic technique that’s new, sometimes it’s the technology for applying it, but often the “coolness”, or at least the nerdy enthusiasm, is from applying existing techniques & tech to a new data source, OUR data source, to answer OUR question – in a way that hasn’t quite been done before. That’s kind of how research is.
Stop worrying so much about your bowls. Unless the lid is on so tight that you can’t get your fruit salad out.