Advanced topics in electrical and electronic engineering

2022-05-19

HKU

字数统计: 12.5k | 阅读时长≈ 77 分钟

Hong Kong University 2022 summer semester class review —— ELEC7078 Advanced topics in electrical and electronic engineering

1. Introduction

1.1 Distributed System

A distributed system is defined as a collection of autonomous computers linked by a network with software designed to produce an integrated computing facility.
A distributed system is a collection of independent computers that appear to the system as a single computer.
Distributed System = Collection of computers + Communication Network + Transparency

1.2 Advantages of Distributed Systems

Price/performance
- A cost-effective way to build larger system is to use a larger number of cheap CPUs.
Nature of some applications
- Some applications are inherently distributed (e.g. banking and supermarket chain).
Reliability
- If one machine crashes, the system as a whole can still survive.
Incremental growth
- Computing power can be added in small increments.
Data sharing
- It allows many users access to a common database;
Device sharing
- It allows many users to share expensive peripherals;
Communication
- It provides communication facilities;
Flexibility
- It spreads the workload over the available machines in the most cost-effective way.

Exercise(1)

Give five types of hardware resource and five types of data or software resource that can usefully be shared. Give examples of their sharing as it occurs in practice in distributed systems.

1.3 Characteristics

Resource Sharing

What to share?
- Hardware devicesData
How to share?
- Resources are stored in workstations and can be accessed via communications by a resource manager
Resource Manager
- A program that offers a communication interface enabling the resource to be accessed, manipulated and updated reliably and consistently.

Openness

Characteristic that determines whether the system can be extended in various ways
Hardware: Additional peripherals, memory or communication interfaces;
Software: Additional operating system features, communication protocols and resource-sharing services.

Concurrency

In a distributed system with M computers (cores) , up to M processes can be executed in parallel.
Parallel executions occur for two reasons:
- More than one users simultaneously invoke commands or interact with application programs
- Many server processes run concurrently, each responding to different requests from client processes.

Scalability

DS can be existed in different scale:
The smallest: 2 workstations + 1 file server
Local area network (LAN):
- hundreds workstations
- Several file servers
- Print servers
Internetwork:
- Several LANS interconnected

Fault Tolerance

The design of fault-tolerant computer systems is based on:
- Hardware redundancy: the use of redundant components
- Software recovery: the design of programs to tolerate (process group) or recover from faults

Transparency

Hidden from the user (application) programmer of separation of components;
Achieve a single system image to make everyone into thinking that the collection of machines is simply an old-fashioned time-sharing system.
Access transparency
- Enable local and remote information to be accessed using identical operations.
Location transparency
- Enable the information objects to be accessed without knowledge of their location (users need not tell where resources are located).
Concurrency transparency
- Enable several processes to operate concurrently using shared information objects without interference (multiple users can share resources automatically).
Replication transparency
- Enable multiple replicas to be used to increase reliability and performance without user knowledge of how many replicas exist.
Failure transparency
- Enable concealment of faults, allowing users to complete their tasks despite the failure of hardware or software components.
Migration transparency
- Allow information objects move within a system without changing their name or affecting users.
Performance transparency
- Allow the system to be configured to improve performance as loads vary.
Scaling transparency
- Allow the system and applications to expand in scale without change to the system structure or the application algorithms.
Parallelism transparency
- Allow the program to be executed in parallel without users knowledge.

1.4 User Requirements

Functionality

A distributed system should bring an improvement over the services provided by any single computer through enhancements of:

Sharing across a network can bring access to a richer variety of resources.
Parallel and Fault-tolerant applications.

How to migrate a multi-user centralized computing to distributed computing?

Adapt existing operating systems
- Continue to use existing operating system software that has been adapted for networking.
- e.g. add servers to UNIX or Sun Network File System.
Move to an entirely new operating system designed specifically for distributed system.
- Existing software becomes unusable.
Emulation
- Move to a new OS designed for DS which can emulate one or more existing OS.
- Existing and new distribution software can run side-by-side.

Re-configurability

Short-term
- A failed process, computer or network component is replaced by another, working counterpart.
- Overload is shifted from over-loaded to less-loaded machines to increase the total throughput of the DS.
- To reduce network communications, data are moved from a machines to the others to make the data accessible.
Medium/long term evolution
- To accommodate heterogeneous components and assign new task or to upgrade the existing machines.

Quality of Service

Performance
- Speed up the response of software components in a distributed system;
Reliability and availability
- Fault-tolerance;
Security
- Apply a reasonable degree of security applied to the data stored and transmitted with a distributed system.

1.5 Basic Design Issues

Naming
- Name resources or objects in order to access them.
Communication
- Optimize the communication implementations in distributed systems while retaining a high level programming model for its use.
Software structure
- Define interface and good abstraction of data and services.
Workload allocation
- Deploy computers and communications to achieve optimum performance and use of resources.
Consistency maintenance
- How to balance consistency & performance?
Security
- How to secure message transfer in a distributed system?

Exercise(2)(3)

List the three main software components that may fail when a client process invokes a method in a server object, giving an example of a failure in each case. To what extent are these failures independent of one another?

The three main software components that may fail are:

the client process e.g. it may crash
the server process e.g. the process may crash
the communication software e.g. a message may fail to arrive

The failures are generally caused independently of one another. Examples of dependent failures:

if the loss of a message causes the client or server process to crash. (The crashing of a server would cause a client to perceive that a reply message is missing and might indirectly cause it to fail).
if clients crashing cause servers problems.
if the crash of a process causes a failures in the communication software.

Suggest how the components can be made to tolerate one another’s failures.

Both processes should be able to tolerate missing messages.

The client must tolerate a missing reply message after it has sent an invocation request message. Instead of making the user wait forever for the reply, a client process could use a timeout and then tell the user it has not been able to contact the server.
A simple server just waits for request messages, executes invocations and sends replies. It should be absolutely immune to lost messages. But if a server stores information about its clients it might eventually fail if clients crash without informing the server (so that it can remove redundant information).
The communication software should be designed to tolerate crashes in the communicating processes.

For example, the failure of one process should not cause problems in the communication between the surviving
processes.

Compare and contrast cloud computing with more traditional client server computing, what is novel about cloud computing as a concept?

Hardware:

CPU: compute server (executes processor-intensive applications for clients), remote object server (executes methods on behalf of clients), worm program (shares cpu capacity of desktop machine with the local user). Most other servers, such as file servers, do some computation for their clients, hence their cpu is a shared resource.
memory: cache server (holds recently-accessed web pages in its RAM, for faster access by other local computers)
disk: file server, virtual disk server (see Chapter 8), video on demand server (see Chapter 15).
screen: Network window systems, such as X-11, allow processes in remote computers to update the content of windows.
printer: networked printers accept print jobs from many computers. managing them with a queuing system.
network capacity: packet transmission enables many simultaneous communication channels (streams of data) to be transmitted on the same circuits.

Data/software:

web page: web servers enable multiple clients to share read-only page content (usually stored in a file, but sometimes generated on-the-fly).
file: file servers enable multiple clients to share read-write files. Conflicting updates may result in inconsistent results. Most useful for files that change infrequently, such as software binaries.
object: possibilities for software objects are limitless. E.g. shared whiteboard, shared diary, room booking system, etc
database: databases are intended to record the definitive state of some related sets of data. They have been shared ever since multi-user computers appeared. They include techniques to manage concurrent updates.
newsgroup content: The netnews system makes read-only copies of the recently-posted news items available to clients throughout the Internet. A copy of newsgroup content is maintained at each netnews server that is an approximate replica of those at other servers. Each server makes its data available to multiple clients.
video/audio stream: Servers can store entire videos on disk and deliver them at playback speed to multiple clients simultaneously.
exclusive lock: a system-level object provided by a lock server, enabling several clients to coordinate their use of a resource (such as printer that does not include a queuing scheme).Distributed

2. Inter-process Communication

进程间通信IPC

Why we need Inter-process Communication (IPC)?

The components of a distributed system are both logically and physically separated
They must communicate in order to interact.

2.1 Communication Patterns

Client-server communication
- request and reply messages provide the basis for communication between clients and servers.
- The idea of the model is to structure the distributed systems as a group of cooperating processes, i.e. the servers, that offer services to the users, namely, the clients.
Group communication
- some messages are sent to several processes in a group.
- Applications are composed of large numbers of peer processes running on separate computers and the pattern of communication between them depends entirely on application requirements.

2.2 Data Passing

For any two computers to exchange data value, we need to map data structures and data items to messages.
Data structure must be flattened before transmission and rebuilt on arrival. (I.e., flattening of structured data into a sequence of basic data)
On receiving data stream, the data structure must be rebuilt.
Marshalling
- the process of taking a collection of data items and assembling them into a form suitable for transmission in a message;
Unmarshalling
- the process of disassembling them on arrival to produce an equivalent collection of data items at the destination;
Usually a language preprocessor (interface compiler) can be used to generate marshalling / unmarshalling operations automatically.
When an IPC primitive is encountered involving data item of the above type, the preprocessor generates code to do the marshalling (for a send) or unmarshalling (for a receive) based on the type description.

JSON (JavaScript Object Notation)

Exercise

Describe and illustrate the client-server architecture of one or more major Internet applications

Web

Browsers are clients of Domain Name Servers (DNS) and web servers (HTTP). Some intranets are configured to interpose a Proxy server. Proxy servers fulfil several purposes – when they are located at the same site as the client, they reduce network delays and network traffic. When they are at the same site as the server, they form a security checkpoint (see pp. 107 and 271) and they can reduce load on the server.

N.B. DNS servers are also involved in all of the application architectures described below, but they ore omitted from the discussion for clarity

Describe and illustrate the peer-to-peer architecture of one or more major Internet applications

2.3 Synchronization

同步

A central issue in the communication structure;
2 types of operations
- Blocking: the invocation blocks the execution of its invoker.
- Non-blocking: the invocation does not block the execution of its invoker.

Blocking

阻塞

Blocking Send
- Issuing process blocks (i.e., control is not passed back) until the message has been sent and received.
Blocking Receive
- Issuing process blocks until a message has arrived and passed to the process.

Blocking vs. Non-blocking

Non-blocking

Non-blocking send
- Issuing process continues (i.e. control is passed back) execution after the message has been copied out of the process’s environment.
Non-blocking receive
- Issuing process continues if there is no message waiting to be received. Receiver process will have to be notified later on message arrival, either by polling or interrupt mechanism.

Synchronous vs. Asynchronous Communication

Drawback: Can lead to inefficiency due to waiting
- high overhead and less efficiency.
Solution: time-out
- Receive (A, msg, TO);
  - If a message is not arrived in TO seconds, the process will be unblocked (and the receive operation aborted).
- Send (B, msg, TO)
  - Sender blocks and if the message is not received in TO seconds, the process will be unblocked.
Synchronous communication
- Blocking send and blocking receive;
- Sender and receiver synchronize at point of message transfer.
- Advantage:
  - Can make definite assumptions on message send and receipt
  - Easy design and control of the distributed processes
Asynchronous communication
- Sender and receiver do not synchronize at message transfer.
- Non-blocking send + non-blocking receive;
- Non-blocking send + blocking receive (usual combination);
More flexible and potentially more parallelism;
Less assumption can be said about sending and receiving - more difficult to verify program properties.
Non-blocking send requires buffering of messages.

2.4 Implementation

Process Location

A port is a location independent identifier which can be mapped into low-level address in order to deliver message.
In TCP/IP, message destination addresses are Port number (used by the process) and Internet address of the computer (the process resides on),
- Send (portB, msg);
- Receive(portB, msg);

Unreliable vs. Reliable Messages

Unreliable message is used to refer to a single message transmitted from sender to receiver, without acknowledgement or retries.
- e.g. UDP only makes its “best effort” to deliver a message.
Reliable message delivery may be constructed from an unreliable one by using acknowledgement.
- Positive ack: receivers send ack message whenever a message is received.
- Negative ack: Receivers do not send ack message until something wrong (timeout or receiving any incorrect message).

Client-Server Communication Protocol

Client-server model uses request-reply communication.
- Request-reply is normally synchronous because a client will wait for the reply.
- Request-reply can be asynchronous in case that the client can afford to retrieve replies later.
Communication failures
- Loss of request message: communication link fails / network switch fails / receiver’s node is down.
- Loss of reply message: communication link fails / network switch fails / sender’s node is down.
- Unsuccessful execution of the request: server crashes while executing the request.
In the presence of communication failures three protocols are used for implementing various type of client-server communication.
The request (R) protocol
- Client issues a Send (server-id, request) and continues. It is suitable for cases in which there is no reply required from the server and that the client requires no confirmation that the request has been carried out.
The request-reply (RR) protocol
- Most commonly used;
- The reply message from the server also acts as acknowledgment to the original request message.
- A subsequent request from the client may be regarded as an acknowledgment of the server’s message.
The request-reply-acknowledge reply (RRA) protocol
- An acknowledgement will be sent back to the server after received the reply
- The ack includes the request-Id and acknowledges all request messages up to that request-Id.
- Although the exchange involves an additional message, it need not block the client as the acknowledgement may be transmitted after the reply has been given to the client, but it does use processing and network resources.

2.5 Other Issues

Time-out
- occur when a request message is lost or the network becomes partitioned, or the server is overloaded (and hence slow); or the reply message is lost.
- DoOperation repeats sending the request message N times (time-outs) before reporting failure.
- It is impossible to distinguish between a process failure and a communication failure. When process does not reply after some agreed number, N of attempts to communicate with it, it is assumed to be unavailable.
- The choice of N is difficult (?).
Duplicated request messages
- Occur when request message is retransmitted (on time-outs).
- Duplicates can be detected using Request-Id (like a sequence number) and discarded.
Lost reply messages
- If the server has already sent the reply message, it may need to execute the request again to obtain the result. Re-executing is only possible for idempotent operation.
An idempotent operation（幂等运算） is an operation that can be performed repeatedly with the same effect as if it had been performed exactly once.
If server operation is not idempotent a record of past results (called history) can be kept. History can be kept from growing too large by using the RAA protocol, or discarding results which have passed a certain time limit.
Multipacket messages
- Datagram with limited length (often as 8 kbytes).
- Not enough if a request or reply is too large.
- Solution with multipacket: a message made up of a sequence of datagrams.
- Drawbacks: complicated in design and control (receive in sequence), low efficiency in retransmission.

3. Time in Distributed Systems

Time is an important and interesting issue in distributed systems because

Internal (computer-to-computer) and external (computer-to-external) synchronization;
Many algorithms depend upon clock synchronization, e.g. transaction.

3.1 Synchronizing Physical Clocks

Physical clocks
Electronic devices that count oscillations occurring in a crystal at a definite frequency.
It is useful for keeping accurate time and time-stamping events, e.g., time in accounting records of connection.
Event is an action that appears to occur indivisibly（不可分割地）.
Sources of accurate timing signals:
- coordinated universal time (UTC)
- Radio broadcast accuracy: 0.1 - 10 ms Satellite (Geostationary Operational Environment Satellite GOES) accuracy: 1 ms
- Satellite (Global Positional System GPS) accuracy: 0.1 ms
Difficulties in distributed systems
- Not all sites have direct access to accurate time sources such as GPS receivers.
- Sites have to synchronize their local clocks with those have more accurate time.
- Synchronization needs to be done periodically due to clock drift: they count time at different rates, and so diverge.（需要周期性同步）

Cristian’s method

A central time server process S supplies the time according to its clock upon request.

If a process P requests the time in a message m_r, and receives the time value t in a message m_t, then it could set its clock to the time t + T_trans, where T_trans is the time taken to transmit m_t from the server S to P.
T_trans can be variant. We may say, T_trans = min + x, where x = 0 and min is the time of message transmission if no other processes and no other messages.
min can be measured or conservatively estimated but x is still unknown!

Let T_round be the total round trip time to send the request m_r and receive the reply m_t, then P should estimate its clock as t + T_round / 2 (T_round can be measured or conservatively estimated ).
The accuracy is ±(T_round / 2 - min).
Problem: single-server failure.
Solution: group synchronization time server.

Exercise

Consider a host using Cristian’s method to synchronize its clock with a time server T and got the following records. Assume the total delay of transmitting a message from the host to the time server or vice versa is 10ms.

Which of these times should be used to set its clock? To what time should it set? Estimate the accuracy of the setting.

The message with minimal delay should be chosen, which is message 2 — 24ms (14:22:10.564)

Time=14:22:10.564+12ms=14:22:10.576

Accuracy: 12-min=+/- 2ms
What will the answer of part (a) change if the total delay is 9ms?

Time=No change

Accuracy: +/-3ms
If it is required to synchronize the host’s to within +/- 1 ms. Discuss how to achieve it. Assume 9ms total delay is used

Keep synchronize until the round-trip is 20ms

The Berkeley algorithm

An algorithm for internal synchronization in BSD UNIX.
(Unlike Cristian’s) In a group sites, one is chosen as coordinator (master). It periodically polls the other sites (slaves) to synchronize their clocks.
Master estimates the slaves’ clock times by observing round trip time (like Cristian’s). It averages the time obtained (including its own).
The average (with probabilities) can cancel out individual clock’s run fast or slow.
Accuracy depends on round-trip time between master and slaves.
Master sends time rate adjust value (+ or -) to slaves, requesting them to adjust their time rates.
Master takes fault-tolerant average.

Exercise

A group of servers using Berkeley algorithm to synchronize their physical clocks. The coordinator received the following replies:

What is the clock difference between the coordinator and members? Give the answer with respect to the coordinator.
Draft the messages to be sent to each member and what should the member do after receiving the message?
What if the clock of member C is running faster than the coordinator by 100ms? What is the potential problem of setting the clock value immediately based on the new adjustment?

Network Time Protocol (NTP)

A standard for clock synchronization throughout the Internet
Design aims and features
- To provide a service enabling clients across the Internet to be synchronized accurately to UTC;
  - Employs statistical techniques for the filtering of timing data and it discriminates between the quality of timing data from different servers
- To provide a reliable service to losses of connectivity;
  - Redundant servers and paths
- To enable clients to resynchronize sufficiently frequently;
  - Scale to large numbers of client and servers

NTP server synchronize - UDP
- Multicast mode
  - high speed LAN
  - Multicast periodically
- Procedure call mode
  - Similar to the operation of Cristian’s algorithm
  - More accurate than multicast mode

3.2 Casual Ordering (Happens-before)

Causal ordering

x < y
- x and y are different events of the same process and x occurs before y
s < r
- s is a send event and r is the corresponding receive event
y < r
- y<s and s<r

Happen before: if a < b

a => b, b => c, c => d, d => f ===> a => f.
- a and e that are not ordered by => are concurrent（并行的）, and write this a||e.
- Logical clock can capture happened-before relation.
- It is a monotonically increasing software counter, whose value need bear no particular relationship to any physical clock.
Causal Order : If a => b then a < b
- a and b are causally related if a => b or b => a
- a and b are independent if not a => b and not b => a

3.3 Logical Time / Virtual Time / Logical Clock

Objective:
- Create virtual time without generating additional message
Each entity x creates and maintains a logical clock C_x
For each event a occurring at x, C(a)=C_x(a)
Denote the timestamp of event a at p by C_p(a) and the timestamp of event b at whatever process it occurred by C(b).

To capture the happened-before relation =>, we have the following rules:

LC1: C_p is incremented before each event is issued at process p: C_p := C_p + 1.
LC2: When a process p sends a message m, it piggybacks on m the value t = C_p. On receiving (m,t), a process q computes C_q := max(C_q, t) and then applies LC1 before timestamping the event rcv(m).

It guarantees a => b C(a) < C(b).

Logical Clock

3.4 Vector Clock

Each entity x_i is equipped with a local integer counter C_i and increment its value by 1 at the beginning of every event like logical clock
Each entity X_i is equipped with a n-dimensional vector V_i of values, one for each entity in the network. The value of V_i[i] is equal to C_i
The value of V_i[j], i≠j is initially 0 and change only when a message arrives of X_i
Whenever an entity X_i sends a message to a neighbor X_j, it encloses the message the vector of values V_i
Whenever an entity X_j processes the arrival of a message with a vector vect of values, it updates its local vector V_j as follows: for all i≠j, it sets V_j[i]:=max{vect[i],V_j[i]}
Partial Order
- A ≤ B if A[i] ≤ B[i] for all indices i
Complete Causal Order
- A < B if and only if A ≤ B and A[i] < B[i] for at least an index i e.g. [2,4,3] <[3,4,3]
Property
- For any two events a and b at Xi, Vi(a)<Vi(b) if and only if t(a)<t(b)
- If a is a sending event and b is a receiving event, then Vi(a)<Vi(b)
- a → b , then V(a)<V(b)
- V(a)<V(b), then a → b??

3.5 Exercise

Consider the following events :

Assume the initial logical clocks of all process is zero, what are the logical clock values of E1, E4, E5, E8, E9 and E11?
What are the causal orders of E5 and E8, E6 and E9, and E2 and E11?

E5 || E8, E6 -> E9, E2->E11
Can we work out the causal order of E2 and E11 by their logical clocks?

No, although the clock value of E11 (11) is greater than E2 (1), there is no guarantee that E2 is happen before E11 (even E2 is actually happened before E11). Consider E2 and E4, the clock value of E4 is greater than E2 but E2 and E4 are not causally related.
What are the vector clocks of E2 and E11? Can we work out the causal order of E2 and E11 by based on the vector clocks?

quiz:

4. Distributed Coordination

Wave and Traversal Algorithms

Wave algorithms: Message passing schemes
- Broadcasting, synchronization and computing global functions
Traversal algorithms: wave algorithms that the events of
- computation are totally ordered by casuality
Elementary tasks of most distributed algorithms
- Election, termination detection, mutual exclusion

4.1 Wave Algorithms

Distributed algorithms → collection of possible computations→ collection of events
Events are ordered partially (casual precedence relation)
A computation is a collection of events, partially ordered by the causal precedence relation
|C| = number of events of computation C
C_p = event occur in process p
decide (event) = internal event

A wave algorithm exchanges a finite number of messages and then makes a decision

Wave Algorithm Requirements

Initiators / Starters
- Starts the execution of its local algorithm spontaneously i.e. triggered by some condition internal to the process
- First event: internal/ send
Non-initiators / followers
- Become involved in the algorithm only when a message of the algorithm arrives and triggers the execution of the process algorithm
- First event: receive
Centralization vs Decentralized
- Centralized / single source: Exactly one
- Decentralized / multi-source: initiator are subset of the process
Topology : ring, tree, clique
Initial Knowledge
- Process identities : each process initially knows its own unique name
- Neighbors’ identities: each process initially knows the name of its neighbors
- Sense of direction
- Number of decisions: number of process execute a decide event
- Complexity: number of exchanged messages

Application of Wave

Propagation of Information with feedback (PIF)
- Some information must be broadcast to all processes and certain processes must receive a notification of when the broadcast is complete
Synchronization
- In each process q an event aq must be executed and in some processes an event bp must be executed, such that the execution of all aq events must have taken place temporally before any of the bp events is executed. In SYN algorithm the bp events will be considered as decide events.
Computation of Infimum Function
- Function must be computed whose value depends essentially on the input of every process.

4.2 Ring Algorithm

4.3 Tree Algorithm

All leaves of the tree initiate the algorithm
If a process has received a message via each of its incident channels except one, the process sends a message via the remaining channel
Each process sends exactly one message in the algorithm
If a process has received a message via all of its incident channels it decides

4.4 Echo Algorithm

回波算法

Initiator sends messages to all its neighbors
Upon receipt of the first message a non-initiator forwards messages to all its neighbors except the one from which the message was received
When a non-initiator has received messages from all its neighbors an echo is sent to the father
When the initiator has received a message from all its neighbors it decides.

4.5 Polling Algorithm

轮询算法

A process can decide if it has received a message from each neighbor

4.6 Phase Algorithm

The phase algorithm can be used in arbitrary directed networks, where channels can carry messages in one direction only
In-neighbors: processes that can send message to the node
Out-neighbors: processes to which the node can send message
Diameter of the network must be known
In the phase algorithm, each process sends exactly D messages to each out-neighbors

4.7 Traversal Algorithms

遍历算法

The traversal algorithms has the following properties:

In each computation there is one initiator which starts the algorithm by sending out exactly one message
A process, upon receipt of a message, either sends out one message or decides
The algorithm terminates in the initiator and when this happens, each process has sent a message at least once

The first two properties imply that in each finite computation exactly one process decides. The algorithm is said to terminate in the single process that decides

Traversing Cliques: Sequential Polling Algorithm

Traversing Connected Network: Tarry’s Algorithm

R1. A process never forwards the token twice through the same channel
R2. A non-initiator forwards the token to its father (the neighbor from which it first received the token) only if there is no other channel possible according to rule R1

5. Distributed Mutual Exclusion and Election

分布式互斥与选举

5.1 Distributed Coordination

分布式协作

Why do we need distributed coordination?
- To prevent interference and ensure consistency before accessing resources, e.g. NFS file system to share a common text file.
Distributed Mutual Exclusion (DME)
- A single process being given a privilege - the right to access shared resources - temporarily before another process is granted it.

5.2 Distributed Mutual Exclusion

分布式互斥

Basic requirements for DME concerning some resources:

ME1: (safety) At most one process may execute in the critical section (CS) at a time.
ME2: (liveness and deadlock-free) A process requesting entry to the CS is eventually granted it (so long as any process executing in the CS eventually leaves it.)
ME3: (ordering) Entry to the CS should be granted in happened-before order.

The central server algorithm: server manages a mutual exclusion token for a set of processes

中央服务器算法:服务器为一组进程管理一个互斥令牌

To employ a server that grants permission to enter a CS.
Assume only one CS is managed. The protocol is as follows:
Satisfy ME1, ME2 and ME3.
Problem 1: the server could be performance bottleneck.
Problem 2: single point of failure.

5.2.1 DME: Ricart and Agrawala’s algorithm

Based on distributed agreement;
No center server is required.
The basic idea is that processes that require entry to a critical section multicast a request message, and can enter it only when all the other processes have replied to this message. （需要进入临界区的进程多播请求消息，并且只有当所有其他进程都对该消息作出了响应时才能进入该消息）
Assumption:
- p₁, …, p_n know one another addresses.
- All messages sent are eventually received and delivered
- each pi keeps a logical clock, updated according to rules LC1 and LC2.
Messages requesting the token are in the form <T,pi> where T is the sender timestamp and pi is the sender id.
Only one CS is concerned.
3 states: RELEASED, WANTED, HELD.

**Protocol: **

Satisfy ME1, ME2 and ME3.
Protocol: to enter CS is associated with possession of a token.
Example: consider three processes p1, p2 and p3.
- p3 is not interested in the token.
- p1 and p2 request it concurrently.
- Timestamps of p1’ and p2’s requests are 41 and 34 respectively.
- p3 replies all requests. p2 does not reply p1’s request since its timestamp is lower while p1 does.

5.2.2 DME: Ring-based Algorithm

One of the simple algorithm is to arrange the n processes into a logical ring.
There is a logical token circulate on the ring (say in clockwise direction).
If a process does not need to enter CS, it immediately forwards the token to its neighbor.

A process wishes to enter the CS waits until it holds the token and retains it.
To leave the CS, the process releases the token to its neighbour.
If a process holding the token fails, an election is required to pick a unique process from the surviving members which regenerate the token and transmit it as before.
Care must be taken if the process is really failed. Two tokens may circulate on the ring!

5.2.3 Exercise

Discuss and compare the performance characteristics of two distributed mutual exclusion algorithms,

(i) the central server algorithm

(ii) the Ricard and Agrawala’s algorithm,

in terms of message overheads, response time of requests, and reliability.

Message overheads

For each token request:

The central server algorithm: 3 messages (request, token, release)
The distributed algorithm: 2(N-1) messages.

For N clients and all will access the CS

The central server algorithm: 3N messages
The distributed algorithm: 2N(N-1) messages -> 2N^2-2

Response Time:

In the central server algorithm, every token request from client requires two messages (request and grant). It waits for a round-trip message time to receive a reply from the server.
In the distributed algorithm using logical clocks, every token sends out (N-1) messages and waits for (N-1) replies. It waits for (N-1) servers.

So, the response time of the central server algorithm is faster.

Reliability:

The central server algorithm is more reliable since only the failure of the central server will make the system stop working.
For the distributed algorithm, any node failure will stop others from entering the critical section.

5.2.4 Exercise 2 (quiz)

The following system uses central server algorithm to achieve distributed mutual exclusion.

a. Which clock, real-time clock of server S or logical clocks in the system, should be used to order the requests in the Request Queue? Briefly explain your answer.

As the order of granting token is based on the request arrival time, the order is based on the real-time clock of server S. Maintaining logical clocks in the system (S, P1, P2 and P3) do not have significant benefit in this case.

b. Briefly suggest how the system recovers from the following situations (consider each situation independently):

P2 is temporary disconnected from the network.

Assume a timer is assigned to the token which define its validity. If the process doesn’t release the token before time out, the server will regenerate a new token for the next request. The work done by the invalid token should be undo.

If the p2 is temporary disconnect from the system and reconnect to the network before time-out, it should have no effect on the system.

However, if p2 only reconnect after time-out, the update associated with the token must be undo.
P2 is crashed and restarted immediately

P2 probably will lost its memory after restart and new token should be generated after time out. P2 should undo any operation associated with the lost token
S is crashed

The system can only be recovered after a new server is elected.

5.3 Distributed Coordination: Election

A method to choose a unique process to play a particular role is called an election algorithm.
Main requirement is for the choice of elected process to be unique, even several processes call elections concurrently

Election: Bully Algorithm

It can be used when the members of a group know the identities and addresses of the other members.
The algorithm selects the surviving member with the largest identity to be coordinator.
Assumption: communication is reliable but processes can fail during an election.
Three types of message:
- election message - sent to announce an election, await for answer.
- answer message - sent in response to an election message.
- coordinator message - sent to announce the identity of the new coordinator.
A process begins an election if it notices coordinator has failed.
The process that knows it has the highest identifier can elect itself as the coordinator simply by sending a coordinator message to all processes with lower identifiers.
Otherwise, a process can begin an election by sending an election message to those processes that have a higher identifier.
It then awaits an answer message in response.
If none arrives within a certain time, the process considers itself the coordinator, and sends a coordinator message to all processes with lower identifiers announcing this fact.
Otherwise (the process receives answer message(s)), the process waits a further limited period for a coordinator message to arrive from the new coordinator.
If none arrives, it begins another election.
If a process receives a coordinator message, it sets the identifier in the message as the new coordinator.
When an election message arrives to a node (with higher identifier), the node sends an answer and starts its own election (i.e. send election message to nodes with higher identifiers).
When the failed process is restarted, it begins election, if it has the highest identity, then it will decide it is coordinator, and announce this to other processes.
It becomes the coordinator even the current coordinator is functioning (so called bully).

Stage 1: p1 detects the failure of the coordinator p4, so p1 and announces an election. (priority: p1<p2<p3)

Stage 2: p2 and p3 send answer message to p1 and then begin their own elections.

Stage 3: Since p3 will receive no answer messages from p4 , it therefore decides it is the coordinator. But it fails before it sends out the coordinator message. Thus p1 timeouts.

Stage 4: Since p1 notices the absence of a coordinator message and begins another election. This time p2 is elected coordinator.

Election: Ring-based Election

Assumption:
- processes are arranged in a logical ring
- each process knows only how to communicate with its neighbor in, say, the clockwise direction. The process does not know the other processes.
- All the processes remain functional and reachable

Initially, every process is marked as a non-participant
Any process can begin an election by marking itself as a participant, placing its ID in an election msg and sending it to its neighbor.
When a process receives an election msg:
- if the arrived ID is greater, it forwards the msg to its neighbor
- If the arrived ID is smaller and the process is not a participant then it substitutes its own ID in the msg and forwards it; but it does not forward the msg if it is already a participant.
On forwarding an election msg, the process marks itself as a participant
If the received ID is that of the receiver itself, this process’s ID must be the greatest and it becomes the coordinator.
The coordinator marks itself as a non-participant once more and sends an elected msg with its ID to its neighbour.
When a process other than the coordinator receives an elected msg, it marks itself as a non-participant and forwards the msg to its neighbor
Marking processes as participant or non-participant is to extinguish the concurrent election messages as soon as possible.
In the worst case, 3n-1 msg are needed to complete an election.

**An Example with two concurrent tokens **

Exercise

In the ring-based election algorithm, the processes need to mark participant or non-participant in an election process. Explain why.

Ans: Two processes may start the election process at the same time leading to more than one election message. The purpose of marking participant or non-participant is to extinguish the concurrent election message as soon as possible and always before the “winning” result has been announced.

6. Replication

Replication is the maintenance of on-line copies of data and other resources.

Motivation

Performance enhancement
- fast response time and increased throughput
High availability
- clients can access an alternative server if the default server fails or becomes unreachable
Fault tolerance
- provide guarantees of correct request processing even though one of the servers in a group fails

Replication: Requirement

Replication transparency
- clients should not be aware that multiple physical copies of data exist.
Consistency transparency
- unacceptable for different clients to obtain differing results
- dealt with how to apply updates to different replica

Ordering Models: Asynchronous

All client requests are processed by the local replica server
The local replica servers communicate updates to all other replica servers. Servers process updates as they arrive

Ordering Models: Totally Synchronous

All update requests are totally ordered. That is, requests are processed at all replicas in the same order.
A next request can be processed only after the previous update has been processed at all servers
Poor performance in response time and throughput
Scheme in between: Quorum-based Schemes (Min. vote for a decision/event) and causality

Ordering Models

6.1 Basic Architectural Model

Three entities:

Replica manager: maintains a physical copy of every logical data item
Client: makes a series of requests.
Front end: handles the requests of clients
- to communicate with replica managers for clients’ requests

6.2 Ordering

Total ordering

If r1 and r2 are requests, then either r1 is processed before r2 at all replica managers or r2 is processed before r1 at all replica managers.

Causal ordering

If r1 and r2 are requests and r1 happened-before r2, then r1 is processed before r2 at all replica managers

Implementation Techniques for Ordering

Hold-Back: a received request is not processed by a replica manager until ordering constraints can be met.
E.g. a bulletin board item Re:Microkernels may be held back until an item concerning Microkernels has already appeared.

Implementing Total Ordering

Basic approach: assigns totally ordered identifiers to requests so that each RM (replica manager) site makes the same ordering decision based on these identifiers
Simple technique: use a process called sequencer to assign identifiers
All requests are sent to the sequencer as well as to the RM sites
The sequencer assigns consecutive increasing identifiers to requests and forwards them to the RM sites
Requests arriving at an RM site are held back until they are next in sequence.

6.3 Another Algorithm for Total Ordering

RM sites propose identifiers for requests to the corresponding FE (front end) sites as they arrive
The FE site use them to generate final identifiers
Each RM sites have:
- Fmax: the largest final identifier agreed so far
- Pmax: its own largest proposed identifier

The FE site sends the request bearing a temporary identifier (larger than previously-used ones (F)) to all RM sites
Each RM at site i replies the proposed identifier (P)
Max(Fmax,Pmax) +1 + i/N
- where N is the number of RM sites
Each RM places the request on its hold-back queue which is ordered with the smallest request identifier
The FE site selects the largest proposed identifiers as the next agreed identifier.
The term i/N makes the selected identifier unique
The FE site then notifies all the RM sites of the final identifier (Agreed ID)
The RM sites attach the final identifier to the request and re-order the request on the hold-back queue. Note that the final identifier may differ from the proposed identifier.
When the request at the front of the hold-back queue has been assigned its final identifier, it is transferred to the tail of the delivery queue.

Advantages:

straightforward to implement
no bottleneck or single point of failure

Implementing Causal Ordering

A bulletin board system with several replica as an example
Assumptions:
- bulletin board items are never removed
- it is updated only by the addition of new items
Vector Clock

6.4 Deadlocks

The use of locks (for exclusive use) can lead to deadlock.
Deadlock is a state in which each member of a group of transactions is waiting for some other member to release a lock.
A wait-for graph can be used to capture this waiting relationship.
A directed edge is added from node T to node U if transaction T is waiting for transaction U to release a lock.
As each transaction can wait for only one data item, the wait-for graph can be simplified to contain transactions only.

Deadlock with Read and Write Locks

Deadlock Resolution

Deadlock prevention
- a transaction needs to get all locks when it starts
- a transaction requests locks in a predefined order
Deadlock detection
- find cycles in the wait-for graph and break the cycle
- the choice of the transaction to abort is not simple
Timeouts
- commonly used
- the transaction is aborted to release the lock after timeout if there is another requesting transaction.

Distributed Deadlock

Detection of a distributed deadlock requires a global wait-for cycle to be found.
A simple solution: one server is dedicated to detect global deadlocks periodically
- combine local wait-for graphs to check for cycles
- disadvantages: single point of failure, lack of fault tolerance and no ability to scale.
- How often to detect deadlocks?

Phantom Deadlocks

A deadlock is “detected” but is not really a deadlock
Phantom deadlocks occur due to the transmission delay of wait-for information
In the course of detection, a transaction that holds a lock will meanwhile have released.
If transactions are using two-phase locks, can phantom deadlock occur?

Edge Chasing

Also called path pushing
The global wait-for graph is not constructed.
Instead, the servers forward messages called Probes that follow the edges of the wait-for graph.
A probe message consists of transaction wait-for relationships representing a path in the global wait-for graph.

Initiation
- When a transaction T starts waiting for another transaction U which is waiting to access a data item at another server, the edge <T->U> is sent to the server of the data item where U is blocked.
Detection
- On receiving probes, the server adds paths to the probes
- If no cycle and a transaction is waiting for a data item at another server, probes are sent to the server.
Resolution
- When deadlock is detected, a transaction is selected to abort.

More Than One Transaction Aborted

Every transaction involved in a deadlock cycle can initiate the deadlock detection
This concurrent detection may lead to the abortions of more than one transaction
To solve it, transactions are given totally-ordered priorities.
The transaction with the lowest priority is aborted to break the deadlock

6.5 Exercise

A front end has vector timestamp (3,5,7) representing the data it has received from members of a group of three replica managers. The three replica managers have vector timestamps (5,2,8), (4,5,6) and (4,5,8), respectively. Which replica manager(s) could immediately satisfy a query from the front end, and what is the resultant time stamp of the front end? Which replica manager could incorporate an update from the front end immediately?

Ans: The replica manager with the timestamp (4,5,8) can satisfy the query because the others have not yet processed at least one update seen by the front end. The resulting timestamp of the front end will be (4,5,8). The replica manager with the timestamp (4,5,8) could incorporate the update.

7. Distributed File Service

7.1 Conventional File System

Responsible for the organization, storage, retrieval, naming, sharing and protection of files
Provides a set of programming operations that characterize the file abstraction
Design to store and manage large numbers of files, with facilities for creating, naming and deleting the files;
Control the access to files

7.2 Distributed File System

An essential component in distributed systems
Use to support the sharing of persistent storage and its information
Enable user programs to access remote files without copying them to a local disk
Provide access to files from diskless nodes
The most heavily-used services →its functionality and performance are critical

7.3 Design Issues

The features that are partially or wholly addressed by most current file services:

Access transparency
- Client programs should be unaware of the distribution of files.
- A single set of operations is provided for access to local and remote files.
- Programs written to operate on local files are able to access remote files without modification.
Location transparency
- Client programs should see a uniform file name space.
- File or groups of files may be relocated without changing their names.
- User programs see the same name space wherever they are executed.
Concurrency transparency
- Changes to a file by one client should not interfere with the operation of other clients simultaneously accessing the same file.
Failure transparency
- Correct operation of servers after failure of a client;
- Correct operation of client programs in the face of lost message and temporary interruptions of the service.
Performance transparency
- Client programs should continue to perform satisfactorily while the load on the file service varies within a specified a range.
Scaling transparency
- The service can be extended by incremental growth to deal with a wide range of loads and system sizes.
Replication transparency (for very large DS)
- A file may be represented by several copies of its contents at different locations.
- It enables multiple servers to share the load of providing a service to many clients, enhancing the performance and scalability of service,
- It enhances fault tolerance by enabling a client to locate another server that holds a copy of the file on the server that has just failed.
Migration transparency (for very large DS)
- Neither client programs nor system administration tables in client nodes need to be changed when files are moved.
- This allows file mobility - files, sets or volumes of files may be moved, either by system administrators or automatically.
Hardware and operating system heterogeneity
- The service interfaces should be defined so that client and server software can be implemented for different OS and computers (for openness).
The features that are not found in current file services but important in the future:
- Support for fine-grained distributed of data
  - As the sophistication of distributed application grows, the sharing of data in small units will become necessary.
  - We need to locate individual objects near the processes that are using them and to cache them individually in those locations.
- Tolerance to network partitioning and detached operation
  - When a file service includes the replication or caching of files, clients may be affected when a network partition occurs.
  - We need to handle the inconsistent database once a network partition occurs.

7.4 File Service Components

Flat file service
- This service is concerned with implementing operations on the contents of files.
- Unique file identifiers (UFIDs) are used to refer to files in all requests for flat file service operations. (construction of UFID will be discussed later)
Directory service
- provide a mapping between text names for files and their UFIDs. (UNIX uses hierarchic file name)
Client module
- A single client module runs in each client computer
- It integrates and extends operations of flat file service and directory service under a single application programming interface
- The interface is available to user-level programs in client computers

7.5 Interface (Flat File Service)

RPC interface used by client modules
Not normally used directly by user-level programs

7.6 Interface (Directory Service)

To provide a service for translating text names to UFIDs.

Designed for fault tolerance:

Repeatable operations: with the exception of create, the operations are idempotent. (repeated execution of create causes a space leak.)
Stateless servers: can be restarted after a failure and resume operation without need for clients or other server to restore any state.

7.7 Case Studies: NFS

Network File System (NFS) is a distributed file system protocol originally developed by Sun Microsystems (Sun) in 1984, allowing a user on a client computer to access files over a computer network much like local storage is accessed.
Design goals:
- Emulation of the UNIX file system interface
- Concurrent access
- One-copy update semantics
  - The file contents seen by all of the processes accessing or updating a given file are those that they would see if only single copy of the file contents existed.
Architecture
- The basic idea of NFS is to allow an arbitrary collection of clients and servers to share a common file system.
- Each NFS server exports one or more of its directories for access by remote clients.
The list of directories a server exports is maintained in the /etc/exports file, so these directories can be exported automatically whenever the server is booted. (migration transparency)
Clients access exported directories by mounting them.
When a client mounts a (remote) directory, it becomes part of its directory hierarchy. (location transparency)
Local and remote file systems accessible on an NFS client.
To access a file, there is (almost) no difference between a file located on a remote file server and a file located on the local disk. (performance transparency)
Once the mounts have been done, nothing special has to be done to achieve sharing. (access transparency)
If two or more clients mount the same directory at the same time, they can communicate by sharing files in their common directories.
The NFS service is stateless and most of the operations of the file access protocol are idempotent. (failure transparency)
It does not support replication transparency because it is a separate service.
It does not support concurrency transparency because it does not aim to improve upon the UNIX approach to the control of concurrent updates of files.
- NFS 4.0 has been extended to support.
It is not scalability: maximum 50 clients, usually 5 - 10 clients.

File Service Components

Virtual File System (VFS)
- Distinguish local and remote files;
- Translate between UNIX-independent File Identifiers (FIDs) used by NFS and the internal file identifiers normally used in UNIX and other file systems.
- UFID used by NFS is called file handle.
- It is opaque (not observed) to users.
- Filesystem identifier: a unique number that is allocated to each file system when it is created.
NFS communication
- Port mapper enables Clients to bind to a service in a given host by name.
- Clients use RPC to access Server interfaces.
- NFS RPC interface is open.
- Any authenticated request will be executed.
- NFS handles the requests according to the UFIDs. (operations are similar to Read, Write, … in Flat file Service)
i-node number: a number that serves to identify and locate the file within the file system in which the file is stored.
i-node generation number: increment each time the i-node number is reused).
file handle is made unique with i-node generation number.

The format of UNIX-Independent File Identifiers (FIDs) used by NFS:

NFS Client module
- Simulate UNIX standard file service;
- Integrated with UNIX kernel;
- Users can access files via system calls;
- Single client module serves all user level processes (in a client machine) with shared cache.
- Path name translation to file handle.
NFS server module
- Integrated with UNIX kernel for performance, including access control and authentication.
- E.g. Sun RPC requires client to present their User_Id & Group_Id (in each RPC call), NFS server checks against access permission in file attribute.
Mount service
- Mount requests are performed as a part of system initialization process. Client user process uses mount system call for further mounting when necessary.
- Mount server uses an RPC interface, mount requests contains the pathname of a directory (to be mounted) and returns the file handle of the directory.
Two kinds of mount services for remote file system: hard mounted or soft mounted;
Hard mounted (“try hard”): User process is suspended until request completed; If not completed (unavailable), client module retries until success.
Soft mounted (“do not try hard”): NFS client module returns a failure notification to user process after a few time-out retries. User process should handle the failure.
Most applications take hard mount (why?)
Automounter
- Dynamically mount a file system if “empty” mount point is referenced.The file system on the first server to respond is mounted at the client using the normal mount service.
Client caching
- NFS client module caches results of: read, write, get-attribute, lookup and read-dir.
- Consistency problem: Writing by a client does not result in the immediate updating of cached copies in other clients.
Server Caching
- Enhance the performance NFS;
- UNIX server cache
  - File pages, directories and file attributes read from disk are retained in buffer cache.
  - Delayed-write: when a page has been altered, its page is written in disk only when the page is needed.
  - To prevent loss data, UNIX flushes altered page every 30 sec.
- NFS Server cache
  - Read cache is the same as UNIX;
  - NFS flushes altered page immediately. Otherwise, clients (sharing a file) may lose some updates.
Solution 1: Timestamp (read)
- Each cache entry is associated with a Timestamp.
- A client requests last modification time from server;
- Compares the time with its cache Timestamp;
- If (time > Timestamp) (time is more recent) then the cache blocks of the file are invalidated and must be re-fetched.
Solution 2: Validation check (read)
- 3 sec. for files & 30 sec. for directories.
- Files have vulnerability window of 3 sec.
- Seems to be tolerable for most applications.
Solution 3: Bio-demon( Block I/O manager process) implemented by NFS client.
- Read-Ahead: a bio-demon is notified after each read request and it requests the transfer of the following file block from the server to the client cache.
- Delayed-write: a block is sent to server if the block is filled.
- Directory blocks are sent whenever a modification has occurred.

7.8 Case study 2: Apache Hadoop

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.
It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
Hadoop File System was developed using distributed file system design.
It is fault-tolerant and designed using low-cost hardware.
To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure.

Features of HDFS

It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software.
The namenode acts as the master server and it does the following tasks.
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.

File Block

The file in HDFS will be divided into one or more segments and stored in individual data nodes.
These file segments are called as blocks.
In other words, the minimum amount of data that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Goals of HDFS

Fault detection and recovery − Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

7.8 Exercises

In-Class Exercise 1

What kind transparency a DFS should be support? Which transparency is/are required by large distributed system only? Why?
Can you discuss how HDFS support any of the above transparency?

Access Transparency

Location Transparency

Concurrency Transparency

Failure Transparency

Performance Transparency

Scaling Transparency

Replication Transparency

Migration Transparency

Revision Exercise 2

Consider the following diagram of NFS

What are the main functions of the Virtual file System and NFS Server?

Virtual File System
- Distinguish local and remote files;
- Translate between UNIX-independent File Identifiers (FIDs) used by NFS and the internal file identifiers normally used in UNIX and other file systems.
NSF Server Module
- Integrated with UNIX kernel for performance, including access control and authentication
Which modules in the diagram may enhance by caching? Briefly explain your answer

NFS client caching and server caching.

Client caching
- NFS client module caches results of: read, write, get-attribute, lookup and read-dir.
- Consistency problem: Writing by a client does not result in the immediate updating of cached copies in other clients.
Server Caching

……….see 7.7

8. Blockchain and Digital Cash

8.1 Fault Tolerance

Replication
Consistency among replicas is maintained by consensus algorithms
State machine replication: E.g. blockchain
Type of Fault
- Node is down
- Node is performing malicious / inconsistent behavior

8.2 Consensus Mechanisms

Requirements

Agreement: All honest nodes decide on the same value
Termination: All honest nodes terminate execution of the consensus process and eventually reach a decision
Validity: The final agreed value must be the one proposed by at least one honest node
Fault tolerant: Allow the presence of faulty or malicious node (Byzantine node)
Byzantine fault tolerance-based
- Nodes publishing signed message
- Agreement is reached when a certain number of messages are received
Leader-based consensus mechanisms
- By competition
- Winning node propose the final value

8.3 CAP Theorem

“Any distributed system cannot have Consistency, Availability, and Partition tolerance simultaneously”
Consistency: all nodes have the same latest copy of data
Availability: the system is up and running, able to serve client without failure
Partition tolerance: The system is still operating correctly even group of nodes are down

8.4 Hashcash

Introduced by Adam back in 1997
Proof of work (PoW) for control email spam
Sender must compute a hash for sending an email (as a proof that they have spent a reasonable amount of computing resources)
Generating hashcash is a compute intensive process, but easy and quick to verify (by receiver)
Acceptable to normal user Cost too much to spammer
B-Money
- Introduced by Wei Dai in 1998
- Creating money via solving computational puzzles such as hashcash
- Each node maintains its own list of transactions
BitGold
- Nick Szabo, 2005
- Solving computational puzzles to mint digital currency
Cryptographic currency
- 2005, Hal Finney
- B-money + hashcash puzzles
No clear solution to resolve conflict between nodes, must be relied on a centralized trusted authority

8.5 Bitcoin

2009
Achieved distributed consensus in a trustless network
Public key cryptography with hashcash as PoW
Provide secure, controlled, and decentralized method of minting digital currency
Blockchain = Electronic cash scheme + distributed systems

8.6 Blockchain

Tires of Blockchain Technology

Blockchain 1.0
- Cryptocurrencies
Blockchain 2.0
- Financial services and contracts
Blockchain 3.0
- Other Applications used in general-purpose industries such as government, health, media

What is Blockchain

Platform for peers to exchange values using transactions without central trusted arbitrator -> Decentralized consensus mechanism
Cryptographically secure
Append-only
Immutable
Updateable only via consensus among peers
Decentralized consensus mechanism
Distributed shared ledger. Transaction are ordered and grouped into blocks
Data structure. Linked list with hash pointers pointed to the previous block

Quick Question

How to carry a “private” communication in an open network?
How to identify (and prove) who are you in an open network?

Secure key

Public and Private Key

Protection Against Eavesdropping

Digtal signature

Hash function

Take a string with any length
Output a fixed length hash code
Small change in the string will give a totally different hash code

Blockchain

The nonce is the number that blockchain miners are solving for.

Nonce

随机数

Nonce, or a “number only used once,” refers to the first number a blockchain miner needs to discover before solving for a block in the blockchain.
Once the mathematical computations are solved by the miner, they are gifted cryptocurrency for their time and skill.
Nonces are difficult to find and are considered a way to weed out the less talented crypto miners.
The world of crypto mining is challenging, and one often needs excellent computational power to even begin to try and solve the nonce.

Generic Element of Blockchain

Addresses
- Unique identifiers used in a transaction stored in blockchain
- Contain sender and recipient information
- Public key or derived from a public key
- Can be reused (for the same user) or one-off
Transaction
- Fundamental unit of a blockchain
- Transfer of value from one address to another
Blocks
- A block is composed of multiple transactionsIt also contains other elements, like previous block hash, timestamp, nonce
Peer-to-peer network
- Network topology for peer to peer communication
Scripting or programming language, Virtual machine
- Predefined sets of commands for nodes to transfer tokens
Nodes
- Node in a blockchain network
- Provide various functions depending on its role
  - Propose and validate transaction
  - Mining to facilitate consensus and secure the blockchain
Smart contracts
- Program runs on top of the blockchain
- Business logic to be executed when certain conditions are met.
- These actions could include releasing funds to the appropriate parties, registering a vehicle, sending notifications, or issuing a ticket

Blockchain: Features

Distributed consensus
- Achieve a single version of truth (agreed by all parties) without the need of a central authority
Transaction verification
- Only transaction that fulfills the predetermined set of rules can be included in a block
Platforms for smart contracts
- Execute business logic on behalf of the user
- Not all blockchains support the execution of smart contract
- Desirable feature
Transferring value between peers
Generating cryptocurrency
- Reward to the miners who validate the transactions and secure the blockchain
Smart property
- Allow someone claim the ownership of property in blockchain
Immutability
- Transaction added onto the blockchain is immutable

Accumulate Blocks

A node uses its private key to sign a transaction
The transaction is propagated to peers which validate the transaction based on the pre-set criteria. (multiple peers are generally required)
After validation, transaction is added to a block. The block will propagate onto the network (Transaction confirmed)
Next block links to the block created in (3). Transaction is double confirmed when the newly added block is confirmed
Reconfirm the transaction whenever a new block is created.
After six confirmations in the bitcoin network, the transaction is final

Benefit

Decentralization
Transparency and trust
- Blockchains are shared
Immutability
It is extremely difficult to change the confirmed data back
High availability
- Involve thousands of nodes in a peer-to-peer network
- Data is replicated and updated on each and every node
Highly secure
- All transactions on a blockchain are cryptographically secured and provide integrity

Challenges and Limitations

Scalability
Adaptability
Regulation
Relatively immature technology
Privacy

Decentralization and Blockchain

Central
- client-server system
- Single authority controls the system
Distributed
- Data and computation are distributed to multiple nodes in the network
- Central authority control all nodes and governs processing
Decentralized
- No authority exists
- Nodes are not dependent on a single master node
- Decentralized consensus
Disintermediation
- In blockchain, everyone can add a transaction into a block, no intermediary is involved
Through competition
- Instead of eliminating the intermediary, the system provides free choice of intermediary
- E.g. Smart contract in blockchain

Blockchain and Full Ecosystem Decentralization

Storage
- Blockchain is not designed to store large amounts of data, e.g. image
- It aims to store simple transaction
- One can use distributed hash tables (DHTs), originally come from peer-to-peer file sharing software, e.g. BitTorrent
- Inter Planetary File System (IPFS), Juan Benet, a decentralized World Wide Web, using Kadamlia DHT and merkle DAG(Directed Acyclic Graph) to provide the storage and searching functionality.
- Filecoin: pays incentives to nodes that store data using the BitSwap mechanism
Communication
- Internet is decentralized by design
- But in practical, central authority (e.g. ISP, email server provider) is in control
- Some apps support communication in a peer-to-peer fashion without the Internet
Computation
- With blockchain technology, e.g. Ethereum, one can send cryptocurrency to anyone for a small fee.
- E.g. smart contracts with embedded business logic can run on the network

8.7 Bitcoin(2)

Bitcoin Transaction

Bitcoin transaction tells the network that the owner of some bitcoin value has authorized the transfer of that value to another owner
New owner sends the bitcoin by creating another similar transaction

Each transaction contains one or more “inputs” and “outputs”
Inputs: bitcoin account to be debited
Outputs: bitcoin account to be creditedInputs and outputs are in bitcoin address
Unspent bitcoin is output to original owner with same / different bitcoin address
Transactions move value from transaction inputs to transaction outputs

Common Transaction Forms

Transaction input

Full-node client
- Contains a copy of every unspent output from every transaction in the blockchain
- Able to construct transaction inputs
- Quickly verify incoming transaction (having correct inputs)
- Requires a lot of disk space
Lightweight client
- May track only the user’s own unspent outputs
Wallet application without unspent transaction output
- Needs to query the bitcoin network or asking a full-node to retrieve unspent transaction output

Transaction Output

Script that creates a claim on the value and can only be redeemed by the introduction of a solution to the script
E.g. A transaction output will contain a script that says something like “This output is payable to whoever can present a signature from the key corresponding to “B” public address”
Since only B’s wallet has the signature, B can present the signature and redeem the output
Transaction fee is outputted to miner for validating and including the transaction in a block to be recorded on the blockchain

**Transaction Confirmation **

Bitcoin network is a peer to peer network
Node connects to several other nodes in bitcoin network
Nodes exchange transactions and blocks
New transaction is flooded to all clients
New transaction will be validated by multiple nodes and added to the blockchain
Direct connection between input node and output node is not required

Bitcoin Mining

New transaction will be firstly added to the temporary pool of unverified transaction maintained by each node
Miner constructs a new block with the following elements
- Unverified transactions from temporary pool
- Previous block reference
- Transaction that pays his own bitcoin address as the block reward
- Transaction fee of including all transaction in this block
Miner proves the validity of his new block, which is computation intensive
Multiple miners are involved in the step (2) and (3), each includes a special transaction in his block (transaction fee)
The miners who firstly provides the proof of the validity of his block wins the competition and his block will be added to the blockchain

Nodes in Bitcoin Network

Full Nodes
- Core clients
- Functionality: Wallet, miner, full blockchain storage, network routing functions
SPV (Simplified Payment Verification) Nodes
- Lightweight clients
- Wallet and network routing
Pool protocol Servers
- Nonstandard nodes
- Using alternative protocols, e.g. stratum protocol
- Mainly for mining and compute hashes

Bitcoin Wallets

Software which stores private or public keys and bitcoin address
Receiving bitcoins
Sending bitcoins
Do not store any coins
In bitcoin network, coins do not exist
Only transaction information is stored on the blockchain

Wallet Types

Non-deterministic wallets
- Contain randomly generated private keys (Just a bunch of Key wallets)
- Bitcoin core client generates some keys when first started or when required
- Managing a large number of key is very difficult and an error prone process (theft and loss of coins) and need to create regular backups
Deterministic wallets
- Key are derived out of a seed value via hash functions
- Seed number is generated randomly
- Seed number is represented by human-readable mnemonic code words
- All keys can be recovered by mnemonic code
Hierarchical Deterministic Wallets
- Does not generate keys directly, it only produces some information that can be used to generate a sequence of private keys
- Store keys in a tree structure derived from a seed
- The seed generates the parent key (master key), which is used to generate child keys (which is used to generate grandchild keys
- Easily recoverable, maintain
- Highly portable
Brain wallets
- Master private key is derived from the hash of passwords that are memorized
- HD wallets derived from a single memorized password
- Subject to password guessing and brute force attacks
Paper wallets
- Required key material is printed on a paper
Hardware wallets
- Use custom-built device or NFC-enabled phone to store keys
Online wallets
- Store keys in cloud service provider
- Users manage their keys via web interface
Mobile wallets
- Wallets are installed on mobile device
- Use camera to scan QR code and make payments

8.8 Exercise

In-Class Exercise 1

Briefly describe how the public and private key are used in blockchain and bitcoin.

Answers:

Public key will be used as the address of the sender or the recipient of a transaction.
Private key presents the ownership of the bitcoin and will be used to sign transactions (fund transfer).

In-Class Exercise 2

Consider the following Bitcoin transactions.

What is the value of X in the last transaction?

X = 0.3BTC （1.0 - 0.5 - 0.2）

**Suppose the following addresses are stored in Peter’s Wallet **

**3xADF341e **

a4Y0a23bb

What is the final balance of Peter’s Bitcoin account after all these transactions?

Ans = 8.1BTC （7 + 0.9 + 0.2）

版权声明： 本博客所有文章除特别声明外，著作权归作者所有。转载请注明出处！