One that bit us with AWS firewall config. The general approach to avoid this is to automate the deployment of such rules, but errors can happen in automation too. It also consumed a lot of review effort.
The AWS firewall interface is possibly one of the worst packet filter style interfaces I've ever seen and I've been involved in selling firewalls since circa 1995 (I avoided CISCO you can tell ;).
Each rule-set itself is simple but they nest, and people had already created premium visualisation tools before we hit an issue, so presumably we weren't a unique snowflake.
More generally the AWS approach to permissions is similarly awkward and works in similar fashion, looks like a tool built for in-house use in Amazon that escaped and founded the largest cloud platform on the planet ;)
Amazon created an AI system to scan for errors, and a service which removes unused permissions. But I think their difficulty in visualising permissions also explains the ubiquitous open S3 bucket issue. We also hit issues inheriting permissions set by AWS themselves, e.g. AWS got the permissions too generous and then we chose what looked like the correct preset group of permissions. I fear AWS are mostly stuck with it.
You hear a lot about people leaving things running, and so on, which has direct budgetary impact as well as increasing attack surface, but there are tools to mitigate this from AWS and other cloud providers, and if they are configured correctly leaving them running a little too long should have modest impact on anything but budget.
So I think the biggest problems we saw all descended from AWS approach to describing/visualising permissions of different sorts. In each case they assign unique but other wise meaningless identifiers, which are all visually similar, then present these in a complex markup that describes the permissions. There are tools to make visualisation better and find errors, but I've used other cloud services which present such aspects in much clearer fashion, although these tools were often describing much simpler configurations - that said 99% of most needs are probably covered by "simple", and handling simple case better is probably an achievable goal for Amazon, where as rewriting the underlying architecture is probably disproportionate to the problem.
We improved this by more automated deployments, so at least we only needed to review deployment scripts and very little manual configuration needed review.
We did also find basic security misconfigurations in common shared images in base containers (think Docker et al not so much AWS images although I assume the same may apply there especially with 3rd party images), which were/are deployed by anyone who built on that container. So they'll be common misconfigurations but we didn't actually get bit by any of these but things like guest OSes running services as root, which eases host escape if there is an RCE. Review and automated redeployment so that fixes to base images are picked up and deployed promptly are how we addressed these issues, but we were still doing manual review of container configuration (automated review tools are now around). On the upside the issues we found and got fixed presumably helped everyone using those containers correctly.
Simon,I think it's a case of using a tool that fits your use case. In simpler cases a VPS or highly simplified control plane with visually intuitive UX is called for. For those of us managing 1K+ node systems, it would be practically impossible to use manual visual UX tools to manage things. So the unruly uuid yaml and json mess is just part of the job and python and go is our "UX".
AWS did start from in house experience running a highly complex and large system. Over the years they have added (or acquired) more small scale options where the drudgery is removed. I think I just saw last week they now have a visual VPS offerring.
(The same was true for Cisco, pushing config changes to 5, or even 50 fws was one thing, but then scaling up to 1000 required code, not UI.)Elisa,I would say if you are using infrastructure as code, the top misconfiguration is invariably the one you haven't tested. Infrastructure is now like any software, there needs to be a clear (ideally machine readable) spec, unit tests, functional tests, and end to end integration tests, including access control tests and things like IAM Analyzer and other formal verification tools. We also use a graph database and rules engine to retest several thousand configurations continuously. We don't assume we got it all right every change (change is happening dozens of times per hour), we assume we got it completely *wrong* every time and need to continuously verify and prove we got it right by way of evidence.But to get off the soapbox, and answer your question directly, the thing we lint and find most frequently abused is granting * in IAM rules (and other RBAC) rather than using the specific granular permissions actually required. We are trying to get away from ANY manual IAM provisioning and ensure all policies are generated by machine. Until then we find that to be the most often abused feature.