Design Exercise



Requirement:

Business Case:

You are building a Marketplace for Self-Employed. The marketplace allows employers to post jobs, while perspective self-employed can bid for projects. In this system, you have two actors:

  1. Seller: Posts a project with detailed project requirements, such as description, maximum budget and last day/time for accepting bids.
  2. Buyer (Self-Employed): Bids for work on a fixed price.

High Level Requirements:

  • 1. Design and Implement REST API to support the following requirements:
a. Create a Project.
b. Get a Project by ID.
   Returned fields should include the lowest bid amount.
c. API to Bid for a Project
d. API to Query for all Open Projects.
  • 2. The Buyer with the lowest bid automatically wins the bid when the deadline is reached.
  • 3. You are welcome to assume unspecified requirements to make it better for the customers.
  • 4. In-memory database is sufficient. Optionally, you are welcome to use a persistent data store of your choice.
  • 5. You are encouraged but not required to take advantage of a service code-generation framework of your choice when performing this exercise.
  • 6. Describe a cloud hosting plan for this service, incorporating scalability, stability, monitoring and disaster recovery.
  • 7. Describe an automated, continuous integration and deployment (CICD) process for production rollout.

Expectations:

  1. This is an open-ended exercise. The goal is to demonstrate how well you design a system with limited requirements
  2. Come prepared with high level Architecture and Design.
  3. You are expected to explain the rationale for your choice of technologies and architectural and design patterns.

Possible onsite extensions

  • Pagination.
  • Architectural changes to support 5M users.
  • Resilient notification mech
  • Decompose Project and Bid into two microservices: data management, communication, etc

Q: Clarity requirements and define scopes.

Assumptions:

  • Normally seller may be reluctant to set the budget to be that clear. Either a range or want the providers to negotiate with them. For simplicity, we assume all projects will have a budget as a float number.
  • Here we assume an easy security model. All registered buyers can check all projects and bid all projects. In the reality, sellers may want to create projects with RBAC(role based access control) enforced. Or for some projects only some levels of buyers can bid.
  • Assume one can only be a seller or a buyer. If he/she want to be both, register a different count. This would simplify the whole design and implementation.
  • Assume one buyer can’t bid a closed project. And the compensate he/she proposes can only be no bigger than the budget.
  • We assume all data can be stored in DB. Thus no data retention will be required in current stage. If they grow too big, we can move outdated data into the secondary DB. Or move the non-critical fields into NoSQL DB.
  • For better consistency, we put the core data into RDMBS.

Q: Diagram of OO Design

Design Exercise: Marketplace System


Q: Design and Implement REST API?

Design Exercise: Marketplace System

Highlights:

  • All data is sent and received as JSON.
  • For authorization, use OAuth2 token in header.
curl -H "Authorization: token OAUTH-TOKEN" https://XXX.XXX.XXX
  • protocol version is: 1.0 for all APIs.

  • Create a Project

Request:

POST /api/v1/projects
{
 "name": string,
 "summary": string,
 "description": string,
 "budget": float,
 "deadline": timestamp
}

# $protocol_version: v1, v2, etc.
  Reject very old client requests, in case of breaking API changes.
  • For security concern, we’d better avoid asking sellerid in the POST body.

Response:

HTTP/1.1 201 OK
{
  "id": int
}
HTTP/1.1 4XX/5XX ERROR
{
  "message": string
}
  • Get a Project by ID. Returned fields should include the lowest bid amount.

Request:

GET /api/v1/projects/${id}

Response:

HTTP/1.1 200 OK
{
  "id": int,
  "summary": string,
  "description": string,
  "budget": float,
  "deadline": timestamp,
  "lowest_bid_amount": int # return -1, if no bid at all
}
HTTP/1.1 4XX/5XX ERROR
{
  "message": string
}
  • API to Bid for a Project

Request:

POST /api/v1/projects/${id}/bid
{
  "amount": float
}
  • For security concern, we’d better avoid asking buyerid in the POST body.

Response:

HTTP/1.1 201 OK
{
  "id": int
}
HTTP/1.1 4XX/5XX ERROR
{
  "message": string
}

If the project deadline is ealier than now, return 405 error.

  • API to Query for all Open Projects.

Request:

GET /api/v1/projects?page=${page}&per_page=${per_page}

# page: page numbering is 1-based

# per_page: How many bid counts we want to see for each page
  Sorted in ascending order.
  The default is 30. The valid range is [1, 400] (inclusive)

Response:

HTTP/1.1 200 OK
{
  "per_page": 10,
  "pages": 1,
  "page": 1,
  "total": 4
  "projects":[
    {
      "id": int,
      "summary": string,
      "description": string,
      "budget": float,
      "deadline": timestamp,
      "lowest_bid_amount": int
    },
    {
      "id": int,
      "summary": string,
      "description": string,
      "budget": float,
      "deadline": timestamp,
      "lowest_bid_amount": int
    }
  ]
}
HTTP/1.1 4XX/5XX ERROR
{
  "message": string
}

Q: Describe a cloud hosting plan for this service, incorporating scalability, stability, monitoring and disaster recovery.

Design Exercise: Marketplace System

Estimated cost: $244/month. (See in https://cloudcraft.co/app)

The design depends on expectations, budgets, and options we may have.

Let’s assume we treat the env as a critical production system. And we want to avoid SPOF(single point of failure) and minimize the downtime.

  • Choose which cloud provider?
Need to choose among mature and advanced public cloud providers.

Currently AWS, Azure, GCE are the leading providers.
Definitely AWS is the most versatile one.

AWS would be more expensive, compared to its competitors and on-premise ones.
When our env is not that big, the difference of cost is not that big.

Hence *we choose AWS for further discussion.*
  • What about DB?
DB is the most critical part. It will not only impact the system
availability but also data integrity.

We use AWS RDS, a hosted RMDBS service.

To avoid SPOF, add one RDS instance with another replica in a different AZ.
  • About DR: Incremental + full backup with S3+Glacier backend data store
1. Enable data incremental backup and weekly full backup.
   This should be fast and only generate GBs of data for medium-size system.
2. Backup is stored in S3. We can keep latest 3 copies as hot backup
3. The code backup dataset will be moved to Glacier automatically.
4. Enforce data retention in Glacier to save cost.
  • About service deployment: ECS/EKS preference, EC2 is fine as well.
For our application: the logic is relatively simple.
Most of the stateful context are saved in RDS.
*Here we choose container deployment over VM deployment.*

ECS/Fargate can be an optional, and EKS is winning.
(Note: currently AWS EKS is only in preview mode)

But before jumping into the conclusion, check with local talents.
Make sure people are comfortable with container technology.

About monitoring:

1. Enable AWS cloudwatch for infra level monitoring: disk, RAM, CPU, fd, etc.
2. Enable RDS cloudwatch metrics: slow query, insane data growth
3. Monitoring application log file for unexpected errors/exceptions
4. Application monitoring: integrate healthcheck API
5. Enable APM monitoring:
   It shall depends on programming languages, or work with developers.
6. *Redirect all alerts to slack*.
   Critical ones to a more public channel.
   And non-critical to internal channels.

Q: Describe an automated, continuous integration and deployment (CICD) process for production rollout.

Nowadays we typically have two standard CI workflows.

One is Jenkins/Bamboo/TeamCity, another set is
GitLab/TravisCI/Bitbucket Pipeline.

The main difference is in the first set, we setup and maintain
powerful server(s). It run lots of tests in a visualized way.

The second set is sort of serverless, or invisible to end
users. Developers only need to put some yaml file. After git push, CI
will work automatically.

Normally the first set is easier to setup and more intuitive. But if
we're with paid plan of GitHub or Bitbucket, the second one takes less
effort.

Here we choose Jenkins for further discussion. This gives us more freedoms with less vendor lock-in issues.

  • 1. Setup Jenkins service by docker.
If we don't have too many concurrent tests, one solo jenkins will work.

Otherwise we need to setup Jenkins master/slave agents.
  • 2. Create Jenkins jobs to run tests.
Typically tests would covers below fields:
1. Lint check(static check)
2. Unit tests
3. Deployment tests
4. Functional tests
5. Behavior and/or UI acceptance tests.
  • 3. Setup the job trigger points. Either by poll or by push mechanism
When people git push to certain branch, we trigger tests.

With pull mechanism, we create scheduled Jenkins job to pull git commits.
In this way, we don't need admin access of the git repo.
No extra setup in Git server(GitHub/Bitbucket/GitLab)

With push mechanism, we need to configure the git hook in git server.
Also add git server's IP to the Jenkins firewall. This is not usually that easy.
The server ip may change from time to time. Thus the hook actions may fail.
Or we need to allowing all public access to Jenkins.

Certainly we can enforce token authentication.
But this still compromise security.

Both comes with pros and cons. Here we choose pull mechanism.
  • 4. Define Jenkins pipeline to rollout production
When all jenkins tests have passed, jenkins job can trigger the deployment.

It can be fully automated. Or add some approval process.

To add approval process, we can use Jenkins pipeline input step feature.

Or define some git commit convention. Say we only monitor push to *master* branch.
And what's more, the git message should contain patterns like "DEPLOY TO PROD".
  • 5. One button deployment.
Typically we may have container deployment or VM deployment.

With container deployment, we can use less of CM(configuration management).
Ask Jenkins to build and push latest docker images.
Then notify prod env to pull given images and trigger deployment

With VM deployment, we might use ssh+CM tool to run deployment.
  • 6. Online rolling upgrade
Nobody wants risky deployment.

With kurbernet, we have built-in rolling upgrade support.

With VM deployments, enforce healthcheck in between of node deployment.
  • 7. Send out notifications. (Slack preferred)
Everybody in sync for prod env update
- Who triggers the deployment. (It could be bots or human)
- When it's updated
- How long it takes
- Whether the deployment has passed or failed

Redirect all major monitoring alerts to the same slack channel.

Q: Architectural changes to support 5M users

TODO: feel like I’m talking about lots of common sense.

Design Exercise: Marketplace System

Estimated cost: $5,750/month. (See in https://cloudcraft.co/app)

  • What 5 million users mean for our capacity planning?
With 5M users, the visitors may be geographically located in different areas.
Different regions or even different countries.

We might not have strict peak hours and non-business hours.

Let's say 10% are active users. So we have 500K active users.

Users are globally located. Let's say 50% would be at days and 50% at nights.
So we assume 250K online users at average.

Apparently most activities would be readonly.
Let's say every 30 seconds people perform one action.
And here we assume read/write ratio is 20/1.

Then the estimation of write OPS is 396.83 per second. ((25K * (1/21))/30)
And the read OPS is 7936.51 per second.
Let's assume active users will create 0.5 projects every month.
And inactive users will 0.01 projects every month.

So we will have 295K new projects created every month.
Let's say each project will generate 50KB data.

So the monthly new data would be 14.75 GB. ((295*50)/1000)
  • About Data Store: separate cold data from hot data.
- Move old data into a secondary data store.
  e.g, projects/bids which are older than 2 years.
  So we can assume the live data would be 354 GB.
  Full DB backup and restore would take several hours.

- Move non-critical data from RDS into a secondary K/V store.
  e.g, project descriptions and pictures.

- Partition data by regions or countries.
  With this tenant design, DB can better scale out.
  Easy to manage, and also to support the QPS of 7.9K/second.
  • Performance Improvements:
- Scale out
  Add more instances for applications.

- Scale up
  Upgrade the machine flavor, if it's not too crazy.

- Add more DB read replica(s)
  Since ratio of read/write is high, more db read replica(s) help.
  Probably we shall need no more than
  • Capacity planning for DB service

From this link, we know 1 RDS with db.r3.8xlarge can provide around 7000 QPS.

We're expecting 396.83 write QPS, and 7936.51 read QPS.

So we can have 3 RDS(db.r3.4xlarge) to support this. 1 master, 2 slaves.

(db.r3.4xlarge:	16 vcpu, 122 GB RAM)
- CloudFront(CDN)
  Webserver can delegate the effort of serving static files to cloudfront.
  Deploy Cloudfront to edges close to end users.
  And use latency-based DNS in AWS Route53.
- AWS Redis(Caching)
  Load the frequent queries into redis cluster. Thus DB can be less busy.
  Perfect candidates of caching could be popular projects, active users, etc.
- DBA improvement for frequent DB actions
  Build secondary DB indices or db views.
  • Avoid Region SPOF
- For serious envs like 5M users, region outage may happen sooner or later.
  Setup a mini and mirror system in another region.
  Configure cross-site async replication. It will serve as a standby system.

- Visitors may come from US, Asian, Europe, or anywhere
  Geolocation deployment speed up the performance.
  • About Cost Saving
1. Add budget monitoring and get alerts if AWS cloud bill is big
2. Evaluate the vendor-lock issue(s).
   For large env, cost will be big if we can have only few options.
3. Enable auto-scaling
4. Watch service characteristic and machine flavors closely.
   With suitable machine flavors, we can use less infra. And it saves cost.
  • About DR
Speed up DB bakcup/restore
1. Instead of sequential table-by-table backup and restore, do it on parallel.
2. Perform backup when traffic is low. More traffic indicates more lockings.

Q: Resilient notification mech

TODO: not sure what does this mean

  • In what scenarios, we might need notification feature?
Notify sellers, when buyers have new bids with their projects.
Conversation notification in between of individuals.
Notify buyers for projects they are interested.
etc.

Typical requirement:

  1. Deliver at-most-once vs at-least-once
  2. Messages in order

Q: Decompose Project and Bid into two microservices: data management, communication, etc.

TODO: not sure whether I have touched the right point

  • docker-compose.yml with all-in-one env: here
  • kubernetes with cluster env: here

Share It, If You Like It.

Leave a Reply

Your email address will not be published. Required fields are marked *