Introduction to Distributed Systems

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another.

Why Do We Need Distributed Systems?

Performance - Distributes computational load across multiple machines, leading to faster processing.
Scalability - Allows systems to scale horizontally by adding more machines rather than upgrading a single machine.
Availability - Ensures high availability and fault tolerance by distributing workloads and replicating data.

Fallacies of Distributed Computing

Fallacy	Explanation
Global Clock Fallacy	No single global clock synchronizing all systems, leading to inconsistencies.
Network is Reliable	Networks experience delays, packet loss, and failures.
Latency is Zero	Communication delays exist and must be accounted for.
Bandwidth is Infinite	Limited bandwidth can lead to congestion.
Network is Secure	Networks are vulnerable to attacks and failures.
Topology Doesn’t Change	Nodes may join or leave dynamically.
One Administrator	Large-scale systems have multiple admins with different policies.
Transport Cost is Zero	Communication has overhead costs (latency, processing).

Challenges in Designing Distributed Systems

Lots of challenges arise due to absence of global state and unreliable communication.

Network Asynchrony - No guarantee that messages will be delivered in order or within a fixed time.
Partial Failure - Some nodes can go down, thus maintaining atomicity can be a issue.
Concurrency - Multiple computations may execute on the same data, requiring synchronization mechanisms.

Fundamental trade-offs in Distributed Systems

Consistency Vs Availability (CAP Theorem) : A system can be consistent ( every node has consistent data at all times ) or available ( system remains operational even if a node fails ) Ex - In banking systems consistency is prioritizes while in social media availability.
Cost Vs Performance : High performance distributed system often require more resources. Ex - Different pricing in AWS DynamoDB offers different performance.

Note - Safety property (something that must never happen) outweighs Liveliness property (something that should happen eventually ) in a distributed environment.

Types of Distributed Systems

Can be classified based on how they handle time and message delays.

Synchronous : Each node has accurate clock and an upper bound on message transmission delay. Ex - Air Traffic control
Asynchronous : No upper bound on how long it takes for a node to deliver a message. Ex - Internet

Types of Failures in Distributed Systems

Failures in distributed systems are inevitable, and systems must be designed to handle them gracefully.

Failure Type	Description	How to Handle
Fail-Stop	Node stops permanently, and others detect it via communication.	Heartbeats and failure detectors
Crash	Node halts silently, and others assume failure based on timeout.	Heartbeats and failure detectors
Omission	Node fails to respond to incoming requests	Implement retries and acknowledgments.
Byzantine	Node exhibits arbitrary behavior, including sending incorrect or malicious messages.	Byzantine Fault Tolerance (BFT) algorithms like PBFT.

Multiple Deliveries of a Message

Assume a Banking application where user is charged, but due to failed response, charges him again.

To deal with situations like these, we can use couple of strategies:

Idempotent Operation Approach : We apply an operation multiple times without changing the result. (change happens only on 1st call )

Ex - Adding a value to the Hashset.
De-Duplication Approach : Give a unique identifier to each message, so even if it delivered multiple times the recipient knows to discard the duplicates

Note - It is possible to have exactly-once processing in a distributed system, but not exactly-once delivery.