Computer Science Mojo

~ David's Notes on coding, software and computer science


Book Notes : Designing Data-Intensive Applications

Category: System Design     Tag:
By: David     On: Sat 20 February 2021     

Good Book on System Design. To be updated.


  • Part 1. Basics
    • 1 Core ideas
      • three main concerns of software systems
        • Reliability: The system should continue to work correctly
          • Good fault tolerance as it is impossible to reduce fault to zero
          • Deliberately inducing faults to test the system
        • Scalability: The system's ability to cope with increased load
          • The system can grow and scale with easy
          • eg. Designing Twitter: use a hybrid approach for common users and celebrities
          • Scaling
            • vertical scaling: making the machine more powerful
            • horizontal scaling: distributing the load across multiple machines
        • Maintainability: Different people can work on the system productively
          • Design principles
            • Operability: make it easy for the ops team to keep it running
            • Simplicity: keeping it as simple and avoid complexity. make it easy for new engineers to onboard
            • Evolvability: make it easy for changes into the future
    • 2 DB: Data models and query
      • relational document and graph
      • Object-Relational Mismatch (impedance mistmatch): db is relational while app code is object oriented
    • 3 DB: Data storage
      • storage engines: log structured, page oriented
      • increasing db index speeds up reads, but slows down writes
      • SSTable (sorted string table): key value storage -> LSM-tree
      • B-tree: self-balancing tree that maintains sorted data
      • Column Oriented Storage
    • 4 Encoding
      • changes happen to requirements, application code, and data -> compatibility
        • backward compatibility: newer code can read data that was written by older code
        • forward compatibility : older code can read data that was written by newer code
      • Formats (need encoding and decoding between the two)
        • in memory: objects and data structures (decoding: parsing, deserialization, unmarshalling)
        • on disk/over network: encoded (encoding: serialization, marshalling)
      • encodings
        • JSON, XML, CSV
          • problems
            • ambiguity around encoding of numbers. precision ?
            • No binary string support
            • Optional schema support for XML and JSON, none for CSV
        • binary encoding: more compact
          • binary encoding libraries: Protocol Buffers, Apache Thrift
      • schema evolution
        • adding a field:
          • backward: can't make it required, must be optional or have a default value
          • forward: adding new field is fine, as long as it is a new tag number
        • removing a field:
          • backward: can only remove a field that is optional
          • forward: can never use the same tag number again
        • renaming a field: is like removing and adding at the same time: do this with caution
  • Part 2. Distributed
    • 5 replication
    • 6 Partitioning/Sharding
    • 7 Transactions
    • 8 Failovers
    • 9 Consistency
  • Part 3. Processing Data
    • 10 Batch
    • 11 Stream
    • 12 Future