Avoiding Pitfalls – Assessment of a Modern System in the Cloud
It all started rather innocently: an acquaintance of a CVP employee asked if we could look at the mobile shopping app they were having an offshore team develop. We’ve been developing systems and assessing IT security for over a decade and we thought this would be a routine exercise.
As we started diving in, we found some concerning decisions, and then increasingly grave problems. The leadership of the startup was impressed enough with our analysis that they requested CVP take on the project to address the problems by reimplementing large parts of the system.
The rest of this article lists some of the issues we encountered in the course of assessing if the system was built and architected in a modern fashion. We hope this list can serve as a brief checklist of what not to do, and that the explanations help your systems avoid the same issues.
- No encryption at rest. Names and phone numbers saved in the app were not encrypted. Where reasonable to do, or required for topics like personally identifiable information, data at rest should be encrypted.
- No encryption in transit. Without encryption of your web traffic, a variety of people could watch what you’re doing and what data you send. In the past few years, most websites and applications have moved to default encryption for all communications. Tools like Let’s Encrypt make this easy for websites, and services like AWS API Gateway only allow HTTPS.
- Passwords stored in the clear. Passwords should never be stored in the clear, and if possible, not stored at all. Ideally, they should be hashed and salted so losing them does not expose the customer’s original password. As you can see in the linked Wikipedia article, this idea was originally spread in the 1970s.
- Passwords transmitted in the clear as a query parameter in a HTTP GET. As previously mentioned, you should use encryption in transit, especially when sending a password. Sending a password in the clear via a URL is especially bad. It is extremely easy via a variety of tools on a local PC or wireless network to see what URLs someone is visiting. If the password is in the URL, it should be considered compromised and not used again.
- Practice the principle of least privilege. In this system, the back-end of the application was built in AWS. All services, including remote SSH and database access, had no firewall or IP range limitations in place. In general, users and systems should only have access to the minimum privileges required. Access control and firewalls should block everything that isn’t necessary.
- MongoDB running in highly insecure default mode. Out of the box, MongoDB starts up with no username and password for the admin user. Coupled with the lack of any firewalls in front of the server, this was very bad and meant it was only a matter of time before someone compromised the entire database. If you are using MongoDB, make sure to harden it by securing the administrative account right after initial setup.
- No data quality. NoSQL DBs like Mongo make it very easy but still not good to have inconsistent or low quality data. Collections (tables) in MongoDB can have flexible fields per records, but the types and attributes should not change record to record without good reason. If they do, you are exchanging a problem in the DB for a problem in your application.
- Data cross-referenced via keys that were subject to change. Part of the system tracked inventory, and had the equivalent of foreign keys. Instead of using unique IDs, data was cross-referenced via smart keys (really, names), that were subject to change, with no observation of integrity to other records that referenced them. This lead to orphaned or out of date data in the system that frustrated users and administrators. It is important to choose the right tool for the job – don’t use NoSQL DBs with weak consistency and controls when you need highly reliable data.
- Hard-coded data in code. Reference data in the application that was subject to change was hard-coded into the software. Use a database when you are supposed to!
- Unnecessary data in a production database. We found what turned out to be tutorial collections in the production DB. Make sure your important databases only have data you need. Ideally, use a tool like Liquibase or Flyway to exercise version control on the database.
- Python’s SimpleHTTPServer for anything but basic development. This test server was serving dynamic content for a web application, responding to API requests from a mobile, and hosting static content, all in production. Proper setup of a web server is a book unto itself, but you should normally start by having something like nginx in front of your application server.
- Repeating the same code multiple times. In several places, we found a function to perform an action (e.g. delete) on an item from a collection for one ID and then a totally separate copy to perform the same action on multiple items. Practice code reuse and modular programming so that you are more productive and not introducing the same bug multiple times.
- No unit tests, performance testing, or handling of different conditions. Several of the functions were obviously only tested with one or two sample records. For example, pagination was only implemented from a UI perspective. When a user tried to browse a table, the back-end returned all records to the front-end, regardless of whether there were 10 records or 10,000. Unsurprisingly, the browser would freeze. Practice test-driven development and make sure you test your application off the happy path.
- Monolithic, hand-built single instance. One virtual server that ran the web server for Python was hosting static content, ran MongoDB, and served TensorFlow predictions. This is a lot for one server to do. It has a high probability of being difficult to patch, and has a large surface area to secure. A modern architecture with microservices and containers tries to break systems down into manageable modules and administers those separately.
- No scaling. All parts of the system ran on one server, with no provision for scaling up or out. This was obviously never tested as multiple requests at the same time caused the system to freeze or crash, impacting all the other parts as well. For a complex web application like this, you should generally have a separate host for the database, the web application server, and the static content, with plans for how each of those can handle additional traffic.
- No backup. There was no automated backup, continuity of operations (COOP) plan, or documentation to that effect. The proper handling of this is also a book by itself, but you should decide your recovery time objective (RTO) and recovery point objective and then work from there to establish your backup strategy.
- No high availability. Similar to the problem with backups, there was no plan for how to address a failure of one component and get it functioning again. A well-designed system should have no single points of failure. For any service, there should be at least one replica of it that is automatically made available in case of a problem.
- Sharing one administrative account and password. The three people working on the system logged into the system using a shared private key which they previously emailed around. In general, users should not share accounts and should minimize use of administrative accounts that can perform dangerous functions.
- Accessing a system only by an IP address that changes. The mobile application was hard-coded to use the public IP of the EC2 instance which was launched without an elastic IP. If the hardware ever had a problem, or there was an accidental restart or shutdown of the server, someone would have needed to re-deploy the app in the iTunes App store. This alone might have caused multiple days of downtime. This is why DNS was invented – so you can change the underlying IP without affecting other applications.
- Failing to use termination protection in AWS. The server hosting nearly all features of the back-end had no termination protection in place, meaning a few misclicks could have brought the whole system down. To prevent accidents with important servers, usage of termination protection makes it slightly more difficult to delete important assets.
- Nothing came online automatically. Software services were not configured to start automatically after an instance restart, requiring someone to SSH in to start multiple applications. Where at all possible, you should configure your systems so they start up gracefully in case of an unexpected restart.
- Poorly designed and non-standard RESTful APIs. The APIs were a poorly named mess with inconsistent terms and overlapping function. Sites like this one walk through how to design a modern RESTful API.