Nuxt HN | Ask HN: How much traffic do you serve and with which database engine?

Ask HN: How much traffic do you serve and with which database engine?

It's common to see here that Postgres hosted in RDS can handle 99% of workloads up to millions of users. I'm building an IoT app with a plan to ingest the IoT traffic into dynamo partitioned on user id (I'm quite familiar with the tradeoffs) and everything else be in Postgres. A few services but not microservice (basically: core service, identity service, IoT data service, notification service). Ingesting and monitoring about 1,000,000 IoT devices daily (1 packer per device per day) and about 1,000,000 users with only 5,000 active users per day (basically we monitor user IoT devices 24/7 but only some 5,000 users will have anomalous results and log in).

In the database posts & discussions here I sometimes find that the opinions are strong but the numbers are missing. Obviously applications have wide variation in traffic and query complexity so apples to apples comparisons are hard. Still, I would greatly benefit from hearing some real world experiences with numbers.

Rough approximation database questions for current or prior applications:

1. How many customers do you have?

2. What's expected daily traffic? Peak traffic?

3. What database engine or engines do you use?

4. How many rows or how much storage does your db have?

5. What else about your application is relevant for database load?

6. Microservice, Service, or monolith. Happy with it?

73 points | by ajhool 226 days ago

$ units -1v '1M req/day' 'req/sec' 1M req/day = 11.574074 req/sec $ units -1v '1M req/day * 8kbyte/req' 'kbyte/sec' # Incoming bandwidth 1M req/day * 8kbyte/req = 92.592593 kbyte/sec $ units -1v '1M req/day * 1kbyte/req' 'years/TB' # Storage reciprocal conversion 1 / (1M req/day * 1kbyte/req) = 2.7379093 years/TB

$ units -1v '5000 req/1 hour' 'req/sec' 5000 req/1 hour = 1.3888889 req/sec $ units -1v '5000 req/1 hour * 90 kbyte/req' 'MB/sec' # Outgoing bandwidth 5000 req/1 hour * 90 kbyte/req = 0.125 MB/sec

10 comments

jedberg 226 days ago
I think you might be asking the wrong questions. They key questions are Queries per second and the median response size of the query.
For example at reddit (15 years ago) we had 10x more vote traffic than comment traffic, but we only needed two databases to handle votes (technically only one the other was just for redundancy).
But we needed nine comments databases. Mainly because the median query response was so much bigger for comments.
[-]
- jascination 226 days ago
  Unrelated: I love HN; a random database question and fkn jedberg is one of the first responders
  [-]
  - chistev 226 days ago
    Who is he?
    [-]
    - mayli 226 days ago
      jedberg.net https://www.jedberg.net I was the first (paid) employee of reddit, and currently run Site Reliability at Netflix.
    - RandomBacon 226 days ago
      Apparently the first paid employee of reddit https://www.jedberg.net
    - ratg13 225 days ago
      You can click on HN usernames to read their profile.
  - moralestapia 226 days ago
    jascinating!
- cogman10 226 days ago
  I'm sure latency matters a lot as well.
  The users of our apps are pretty tolerant of 5 to 10 minute request times for some of our pages, which means we've been able to get away with just a few servers for several TBs of data stored and served. (100+mb responses are not unusual for us).
  If we had to rethink and redesign the system to cut down those times, we'd need a lot more databases and a much cleverer storage strategy than we currently have.
  While I'm sure response time for Reddit is really important, I could imagine that an IOT serving system needs almost nothing in to hit something like a 10 to 20 second response time for 5k devices.
  [-]
  - gnz11 225 days ago
    Your users are tolerant of 5-10 minute page request times? Is that a typo?
    [-]
    - cogman10 225 days ago
      Nope. I'm very fortunate
      [-]
- ajhool 226 days ago
  Thank you very much! Did you start with separate databases or wait until you needed to scale workloads separately before breaking things up?
  [-]
  - chistev 226 days ago
    I imagine they didn't optimize prematurely.
    [-]
    - dpkirchner 225 days ago
      Not to denigrate jedberg but a lot of folks optimized prematurely before realizing they shouldn't. (I've certainly been guilty of it.)
mattmanser 226 days ago
5,000 users PER DAY is trivial sauce, you're totally worrying about something ridiculous. Even a crap server with crap code should handle that.
BTW most databases on a decent server could totally handle that 1 million IoT updates per day too. 1 packet per day is nothing. Unless they all come at once. That is also a fairly trivial load, if it's spread out. A small VM could handle that.
You are way off on your understanding of what is a heavy load.
You could load test with something like k6 if you want to find out. Try 'emulating' the requests and average users.
I often test with 5,000 requests per second, 5,000 users per day with 20-30 requests each is several orders of magnitude less load.
iot_throwaway 226 days ago
I’m not going to pile on with “these are bad questions”.
We serve a few hundred concurrently online users of our web dashboard which visualises data from our ~100k online IoT device fleet.
With respect to our time series systems:
We store around 1M fields (kv pairs) per second, around 10k rows per second i.e. almost a trillion rows per day and almost a hundred trillion new data points per day.
We compress this down to around 1TB of data per day using Clickhouse and store and make available the last 30 days of data for “online analysis”.
We also store all the data we’ve ever ingested in BigQuery for M/L purposes. I shudder to think of how much this costs.
Our queries are all pretty simple and fast, just scanning and aggregating with filtering, all under 100ms.
This costs around 30k USD per month on AWS, not including networking which is too hard for me to separate from our non time series workloads.
Our infrastructure is all python and microservices, using Kafka, and under 1000 cores for the whole pipeline excluding databases.
Our databases are I believe low tens of the largest graviton instances that were available the last time we upgraded.
This is maintained by a backend team of around 15 people, and a total engineering team including hardware, data and software of around 45 people.
0xdeafbeef 226 days ago
TiDB, 200k inserts per second; 200b per row on average. Bursty insert pattern, e.g., can have 200k inserts for an hour, then almost 0 for days. 8k reads per second on average, mostly reads by primary key. 20 hosts; 16 threads x 128GB RAM, 8TB NVME RAID 10. 60TiB of useful storage with replication factor of 3. Keyset pagination is the key. Also using rocksb for inserts batching. Costs around 20k on ovh
[-]
- ajhool 225 days ago
  thanks for the info and context
dharmab 226 days ago
Asking about customers is the wrong question.
1. It's often information that cannot be casually shared for legal reasons (MNPI)
2. A single customer might generate many queries. There have been times where a single one of my employer's customers generates more traffic than most companies will ever reach at peak.
[-]
- ajhool 226 days ago
  Fair. Please interpret as queries rather than customers.
  [-]
  - dharmab 226 days ago
    This is also the wrong question because the load of a query can vary drastically based on the query's structure/complexity, the size and types of the data, and which indexes and tables are used.
    The right question is - discuss a problem you'd like to solve using PostgreSQL and find people who have solved similar problems.
    [-]
    - ajhool 226 days ago
      Thank you for pointing that out, I agree. I'm just looking for peoples' experiences with their applications, which of course all have different workloads.
      People did get the spirit of the question and have provided helpful responses, including context about their query structure/complexity.
tmountain 226 days ago
You’re asking the wrong questions. Query complexity and patterns matter. Nobody can answer this for you. You have to do the analysis based on your workload.
[-]
- ajhool 226 days ago
  Agreed and thank you for pointing that out. I regret including my own app details as they were a distraction from the main question, I was mainly looking for other peoples' real world experiences and numbers, not help with my own app.
  While reviewing IoT databases I saw a lot of discussion here and elsewhere about database performance and wanted to see some numbers. Of course, applications with different workloads will have different database needs, but it is still helpful to hear that 5000 rps were served with ABC database with XYZ hardware assuming that the majority of queries on the majority of apps are simple lookups.
cullenking 226 days ago
I'll bite, just so you get a real answer instead of the very correct but annoying "don't worry about it right now" answers everyone else is going to provide!
We have a rails monolith that sends our master database instance between 2,000 and 10,000 queries per second depending on the time of year. We have a seasonal bike business with more traffic in the summer. 5% of queries are insert/update/delete, the rest read.
mariadb (mysql flavor), all reads and writes sent just to master. Two slaves, one for live failover, the other sitting on a ZFS volume for backup snapshotting sending snapshots off to rsync.net (they are awesome BTW).
We run all our own hardware. The database machines have 512gb of ram and dual EPYC 74F3 24 core processors, backed by a 4 drive raid10 nvme linux software raid volume on top of micron 9300 drives. These machines also house a legacy mongodb cluster (actually a really really nice and easy to maintain key/value store, which is how we use it) on a separate raid volume, an elastic search cluster, and a redis cluster. The redis cluster often is doing 10,000 commands a second on a 20gb db, and the elastic search cluster is a 3tb full text search + geo search database that does about 150 queries a second.
In other words, mysql isn't single tenant here, though it is single tenant on the drives that back our mysql database.
We don't have any caching as it pertains to database queries. yes we shove some expensive to compute data in redis and use that as a cache, but it wouldn't be hitting our database on a cache miss, it would instead recalculate it on the fly from GPS data. I would expect to 3-5x our current traffic before considering caching more seriously, but I'll probably once again just upgrade machines instead. I've been saying this for 15 years....
At the end of 2024 I went on a really fun quest to cut our DB size from 1.4tb down to about 500gb, along with a bunch of query performance improvements (remove unnecessary writes with small refactors, better indexes, dropping unneeded indices, changing from strings to enums in places, etc). I spent about 1 week of very enjoyable and fast paced work to accomplish this while everyone was out christmas break (my day job is now mostly management), and prob would need another 2 weeks to go after the other 30% performance improvements I have in mind.
All this is to serve a daily average of 200-300 http requests per second to our backend, with a mix of website visitors and users of our mobile apps. I've seen 1000rps steady-state peak peak last year and wasn't worried about anything. I wouldn't be surprised if we could get up to 5,000rps to our API with this current setup and a little tuning.
The biggest table by storage and by row count has 300 million rows and I think 150gb including indexes, though I've had a few tables eclipse a billion rows before rearchitecting things. Basically, if you use DB for analytics things get silly, but you can go a long ways before thinking "maybe this should go in its own datastore like clickhouse".
Also, it's not just queries per second, but also row operations per second. mysql is really really fast. We had some hidden performance issues that allowed me to go from 10,000,000 row ops per second down to 200,000 row ops per second right now. This didn't really change any noticable query performance, mysql was cool for some things just doing a ton of full table scans all over the place....
[-]
- ajhool 226 days ago
  wonderful, thank you. Some translations to AWS RDS...
  "512gb of ram and dual EPYC 74F3 24 core processors, backed by a 4 drive raid10 nvme linux software raid volume on top of micron 9300 drives"
  roughly translates to about an db.r8g.16xlarge (64 vCPUs, 512gb ram) $4,949 / month on-demand for compute
  I'm not familiar enough with hardware to determine IOPS for the raid config but I believe it is greater than the maximum for io2 block express storage on aws (256k IOPS):
  $0.10 per provisioned IOPS-month = 256000$.10 = $25,600 / month IOPS -- which feels high so I might be way off on the raid setup's IOPS
  $0.125 per GB-month storage = 500gb $0.125 = $62.50
  That's about $31,930 / month without any reserved discounts for an estimated capacity of 5,000 rps, sound about right? Would you say your total hardware cost is less than one or two months of comparable compute on AWS if the above is true?
  [-]
  - cullenking 226 days ago
    Yup, last time I priced this in RDS I got to maybe $20k a month for two reserved instances across AZs.
    I pay for our rack outright every 3-4 months from what I can tell. Still takes the same number of infra/ops/sre people as well. We staff 2, but really just have 1.25 worth of FTE work, you just need more for redundancy.
    Pretty nuts! This is also why I am so dismissive of performance optimization. Yeah, I'll just buy a new set of three machines with 2tb of ram each in a few years and call it good, still come out ahead.
    [-]
    - ajhool 226 days ago
      Much appreciated, thank you and congratulations on the bike business
Delomomonl 226 days ago
This sounds just wrong.
Why would you use micro service? You are not having teams.
Btw. Your data structure/ bytes per row is missing
[-]
- quinnirill 225 days ago
  Having teams is exactly the wrong reason to have micro services, having endpoints with different traffic patterns and resource/scaling constraints is exactly the correct reason.
  [-]
  - brewmarche 225 days ago
    Do you really need to jump to micro services when you need scaling? You can also start multiple instances of monoliths/bigger services.
    [-]
    - quinnirill 225 days ago
      If all the endpoints share the same resource constraints and traffic patterns, sure you can just scale up the monolith.
      Even otherwise, yes, you sure still can, just like you can write the service in bash if that’s how you want to spend money.
mannyv 226 days ago
With any database architecture the goal is to ensure that nothing hits your database if you can help it.
We served 13 billion requests last year, according to fastly, and burned 3.84PB of data.
Each one of those requests will generate a bunch of metrics, depending on what's happening.
To back that we have an Aurora db.r6g.large, which we've been using for a few years now. We just redid the metrics, so the CPU is now running at about 40% (it was at like 6% for a few years). Basically instead of aggregating metrics we moved to basically a time-series sort of structure.
The reason we can do that is simple: because we have no realtime requirement, we push all the requests through fastly, which pushes the logs to s3. Fastly uploads the logs to s3 every few minutes, and there's a lambda that chews through the logs whenever they show up. Usually there are like 10-15 of them running at a given time.
So really, by removing the realtime requirement we've been able to get away with getting our metrics for free (for the most part). In any case there's only like 5-10 minutes of lag, but since our customers generally look at yesterday's numbers 10AM local time it's not important.
So really, how big of a database do you need? With aurora unlimited storage you might as well use that. Do you need the data? At these prices you might as well keep all of it. Why not? Archive everything to s3, so you can rebuild it later (this has come in handy a few times).
From the UI, you need to eliminate queries until the user asks for them. Then limit the queries as much as possible. Use indexes. Ideally you'd pre-aggregate your data if possible.
For IoT, unless you need interactive responses just use HTTP GET requests into a CDN. No mqtt, etc. Why bother? And you probably don't have to worry about intermediate network devices clipping your GET requests, since you're deploying into someone's environment (make it a requirement for their network people).
For sizes and rows, they're not that big until recently. I think the biggest table has like 20-30 million rows? We aggregated stuff so our sizes were relatively small.
We have a monolith that handles the UI, api endpoints, and some maintenance things, and a bunch of lambdas that handles all the backend. Overall it's been pretty great; the backend that does work basically scales to some huge number. The real issue with lambdas is limiting concurrency so the DB doesn't get overwhelmed, and watching the memory/timeout numbers. Oh, and making sure that when you hook them up to SQS you don't let AWS fire up like 2000 lambdas (which works great for s3 processing, btw).
wolfgang42 226 days ago
Let's do some math.
First, the IOT devices reporting daily. In the absence of further context, I’m going to assume that it doesn’t matter when the devices report, so they can be configured to spread out their load. I’m also going to assume 1kb of data per device, but with an HTTPS API there’s roughly 7kb of overhead that we need to account for when calculating bandwidth. (Source: http://netsekure.org/2010/03/tls-overhead/ . TLS session resumption gets that down to ~300 bytes, but adds implementation complications.)
```
    $ units -1v '1M req/day' 'req/sec'
        1M req/day = 11.574074 req/sec
    $ units -1v '1M req/day * 8kbyte/req' 'kbyte/sec' # Incoming bandwidth
        1M req/day * 8kbyte/req = 92.592593 kbyte/sec
    $ units -1v '1M req/day * 1kbyte/req' 'years/TB' # Storage
        reciprocal conversion
        1 / (1M req/day * 1kbyte/req) = 2.7379093 years/TB
```
It looks like our load here is a whopping 12 RPS, and we could handle the traffic on an ISDN line from the 1990s. Data storage is a little trickier; if we can’t compress or delete old data we may have to stop by Best Buy for a new hard drive every half decade or so.
Users can’t be configured to load balance themselves, so we’ll be pessemistic and assume that every single one of them logs in to check their device over their morning coffee. We’ll also assume that every time they do that they want all of the data from the last 3 months, though in practice this could probably be summarized before we send it to them.
```
    $ units -1v '5000 req/1 hour' 'req/sec'
        5000 req/1 hour = 1.3888889 req/sec
    $ units -1v '5000 req/1 hour * 90 kbyte/req' 'MB/sec' # Outgoing bandwidth
        5000 req/1 hour * 90 kbyte/req = 0.125 MB/sec
```
For this, we have just under 2 RPS, but our responses are quite a lot bigger so the bandwidth is higher—we probably want to move into the early 2000s and upgrade to a DSL connection. Oh, and we also want to make sure our disk can handle the read load—but conveniently, since we’re just pulling raw data and not asking for any complicated joins, these numbers are actually the same. 2 RPS gives us 500ms per request; since a spinning rust drive pessimistically takes 10ms/seek we only have a ~50-seek budget per request so we probably want to make sure to cluster or partition the table by device to improve data locality. (Doing that increases the write load significantly, though, so maybe we want to upgrade to an SSD or think about a database that’s smart enough to do some kind of periodic bulk rebalancing.)
Oh, I almost forgot, we'll also want to make sure we have disk space to keep track of all of those idle logins who aren’t doing anything:
```
    $ units -1v '1M records * 1kbyte/rec' 'GB'
        1M records * 1kbyte/rec = 1 GB
```
Modern computers—for very relative definitions of modern—are fast, if they don’t get bogged down. Based on my numbers, you could probably run your system from a Macbook in your desk drawer; but it would be trivial to add a requirement which would multiply any of these numbers by several orders of magnitude. The problem with trying to compare architectures is that you have to really understand what your needs are, and not only are they going to be difficult to compare to someone else’s setup (small differences can add up fast when multiplied across requests) but it’s also hard to tell how much effort the other system put into optimizing (or, for that matter, how much effort you want to put into optimizing).
[-]
- chatmasta 226 days ago
  This is a nice analysis of incoming and outgoing bandwidth, but totally neglects to analyze the compute power required by each query. You can’t answer that without knowing the possible queries that users can send. Do they have a dashboard with a known set of queries? Do they have a GraphQL API that can produce pathological queries? Can they send SQL directly? Are they running aggregations? Do they use filters? Are they querying recent data or all historical data?
  And that’s just the read side. You also need to ask about ingestion and transformation. Is the database storing raw events and nothing else? (Is the user happy with that?) Is it append only? Is it being rolled up and summarized into daily partitions? How many transactions per second? How many of those are INSERT vs. UPDATE or DELETE? Which rows are being updated? Only recent or any of them? All of them?
  etc…
  There is no generic answer to this question, and “requests per second” is a reductive and insufficient interpretation of the problem that won’t identify any of the hidden complexity.
  [-]
  - wolfgang42 226 days ago
    The OP doesn’t give any hints as to what users are doing, so as I said I just assumed “last 90 days of data” was the only query they’d make, but it’s a fair point that that caused me to leave out CPU power since it’s basically negligible for just shoveling data off of a disk.
    Thanks for expanding on my comments about how important actually having detailed requirements is for doing performance calculations!
    [-]
    - chatmasta 226 days ago
      Yes indeed, OP only asked “how much traffic is required,” which could be zero if his database is locked while trying to respond to some absurd query from one of his users :)