From 54df2afaa61c6a03cbb4a33c9b90fa572b6d07b8 Mon Sep 17 00:00:00 2001 From: Jesse Morgan Date: Sat, 17 Dec 2016 21:28:53 -0800 Subject: Berkeley DB 4.8 with rust build script for linux. --- db-4.8.30/rep/mlease.html | 1197 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1197 insertions(+) create mode 100644 db-4.8.30/rep/mlease.html (limited to 'db-4.8.30/rep/mlease.html') diff --git a/db-4.8.30/rep/mlease.html b/db-4.8.30/rep/mlease.html new file mode 100644 index 0000000..85b0aca --- /dev/null +++ b/db-4.8.30/rep/mlease.html @@ -0,0 +1,1197 @@ + + + + + + Master Lease + + + +

Master Leases for Berkeley DB

+ +Susan LoVerso
+sue@sleepycat.com
+Rev 1.1
+2007 Feb 2
+ +

What are Master Leases?

+A master lease is a mechanism whereby clients grant master-ship rights +to a site and that master, by holding lease rights can provide a +guarantee of durability to a replication group for a given period of +time. By granting a lease to a master, +a client will not participate in an election to elect a new +master until that granted master lease has expired. By holding a +collection of granted leases, a master will be able to supply +authoritative read requests to applications. By holding leases a +read operation on a master can guarantee several things to the +application:
+

Authoritative reads: a guarantee that the data being read by the +application is durable and can never be rolled back.
Freshness: a guarantee that the data being read by the +application at the master is +not stale.
Master viability: a guarantee that a current master with valid +leases will not encounter a duplicate master situation.
+

Requirements

+The requirements of DB to support this include:
+

After turning them on, users can choose to ignore them in reads +or not.
We are providing read authority on the master only. A +read on a client is equivalent to a read while ignoring leases.
We guarantee that data committed on a master that has been +read by an application on the +master will not be rolled back. Data read on a client or +while ignoring leases or data +successfully updated/committed but not read, +may be rolled back.
+
A master will not return successfully from a read operation +unless it holds a +majority of leases unless leases are ignored.
Master leases will remove the possibility of a current/correct +master being "shot down" by DUPMASTER. NOTE: Old/Expired +masters may discover a +later master and return DUPMASTER to the application however.
+
Any send callback failure must result in premature lease +expiration on the master.
+
Users who change the system clock during master leases void the +guarantee and may get undefined behavior. We assume time always +runs forward.
+
Clients are forbidden from participating in elections while they +have an outstanding lease granted to another site.
Clients are forbidden from accepting a new master while they have +an outstanding lease granted to another site.
Clients are forbidden from upgrading themselves to master while +they have an outstanding lease granted to another site.
When asked for a lease grant explicitly by the master, the client +cannot grant the lease to the master unless the LSN in the master's +request has been processed by this client.
+

+The requirements of the +application using leases include:
+

Users must implement (Base API users on their own, RepMgr users +via configuration) a majority (or larger) ACK policy.
+
The application must use the election mechanism to decide a master. +It may not simply declare a site master.
The send callback must return an error if the majority ACK policy +is not met for PERM records.
Users must set the number of sites in the group.
Using leases in a replication group is all-or-none. +Therefore, if a site knows it is using leases, it can assume other +sites are also.
+
All applications that care about read guarantees must forward or +perform all reads on the master. Reading on the client means a +read ignoring leases.

There are some open questions +remaining.

There is one major showstopper issue, see Crashing - Potential +problem near the end of the document. We need a better solution +than the one shown there (writing to disk every time a lease is +granted). Perhaps just documenting that durability means it must be +flushed to disk before success to avoid that situation?
+
What about db->join? Users can call join, but the calls +on the join cursor to get the data would be subject to leases and +therefore protected. Ok, this is not an open question.
What about other read-like operations? Clearly +DB->get, DB->pget, DBC->get, +DBC->pget need lease checks. However, other APIs use +keys. DB->key_range +provides an estimate only so it shouldn't need lease checks. +DB->stat provides exact counts +to bt_nkeys and bt_ndata fields. Are those +fields considered authoritative that providing those values implies a +durability guarantee and therefore DB->stat +should be subject to lease verification? DBC->count +provides a count for +the number of data items associated with a key. Is this +authoritative information? This is similar to stat - should it be +subject to lease verification?
+
Do we require master lease checks on write operations? I +think lease checks are not needed on write operations. It doesn't +add correctness and adds a lot of complexity (checking leases in put, +del, and cursors, then what about rename, remove, etc).
+
Do master leases give an iron-clad guarantee of never rolling +back a transaction? No, but it should mean that a committed transaction +can never be read on a master +unless the lease is valid. A committed transaction on a master +that has never been presented to the application may get rolled back.
+
Do we need to quarantine or prevent reads on an ex-master until +sync-up is done? No. A master that is simply downgraded to +client or crashes and reboots is now a client. Reading from that +client is the same as saying Ignore Leases.
What about adding and removing sites while leases are +active? This is SR 14778. A consistent nsites value +is required by master +leases. It isn't +clear to me what a master is +supposed to do if the value of nsites gets smaller while leases are +active. Perhaps it leaves its larger table intact and simply +checks for a smaller number of granted leases?
+
Can users turn leases off? No. There is no planned turn +leases off API.
Clock skew will be a percentage. However, the smallest, 1%, +is probably rather large for clock skew. Percentage was chosen +for simplicity and similarity to other APIs. What granularity is +appropriate here?

API Changes

+The API changes that are visible +to the user are fairly minimal. +There are a few API calls they need to make to configure master leases +and then there is the API call to turn them on. There is also a +new flag to existing APIs to allow read operations to ignore leases and +return data that +may be non-durable potentially.
+

Lease Timeout
+

+There is a new timout the user +must configure for leases called DB_REP_LEASE_TIMEOUT. +This timeout will be new to +the dbenv->rep_set_timeout method. The DB_REP_LEASE_TIMEOUT +has no default and it is required that the user configure a timeout +before they turn on leases (obviously, this timeout need not be set of +leases will not be used). That timeout is the amount of time +the lease is valid on the master and how long it is granted +on the client. This timeout must be the same +value on all sites (like log file size). The timeout used when +refreshing leases is the DB_REP_ACK_TIMEOUT +for RepMgr application. For Base API applications, lease +refreshes will use the same mechanism as PERM messages and they +should +have no additional burden. This timeout is used for lease +refreshment and is the amount of time a reader will wait to refresh +leases before returning failure to the application from a read +operation.
+
+This timeout will be both stored +with its original value, and also +converted to a db_timespec +using the DB_TIMEOUT_TO_TIMESPEC +macro and have the clock skew accounted for and stored in the shared +rep structure:
+

db_timeout_t lease_timeout;
db_timespec lease_duration;

+NOTE: By sending the lease refresh during DB operations, we are +forcing/assuming that the operation's process has a replication +transport function set. That is obviously the case for write +operations, but would it be a burden for read processes (on a +master)? I think mostly not, but if we need leases for +DB->stat then we need to +document it as it is certainly possible for an application to have a +separate or dedicated stat +application or attempt to use db_stat +(which will not work if leases must be checked).
+
+Leases should be checked after the local operation so that we don't +have a window/boundary if we were to check leases first, get +descheduled, the lose our lease and then perform the operation. +Do the operation, then check leases before returning to the user.
+

Using Leases

+There is a new API that the user must call to tell the system to use +the lease mechanism. The method must be called before the +application calls dbenv->rep_start +or dbenv->repmgr_start. +This new +method is:
+
+

    dbenv->rep_set_lease(DB_ENV *dbenv, u_int32_t clock_scale_factor, u_int32_t flags)

+

+The clock_scale_factor +parameter is interpreted as a percentage, greater than 100 (to transmit +a floating point number as an integer to the API) that represents the +maximum shkew between any two sites' clocks. That is, a clock_scale_factor of 150 suggests +that the greatest discrepancy between clocks is that one runs 50% +faster than the others. Both the +master and client sides +compensate for possible clock skew. The master uses the value to +compensate in case the replica has a slow clock and replicas compensate +in case they have a fast clock. This scaling factor will need to +be divided by 100 on all sites to truly represent the percentage for +adjustments made to time values.
+
+Assume the slowest replica's clock is a factor of clock_scale_factor +slower than the +fastest clock. Using that assumption, if the fastest clock goes +from time t1 to t2 in X +seconds, the slowest clock does it in (clock_scale_factor / 100) +* X seconds.
+
+The flags parameter is not +currently used.
+
+When the dbenv->rep_set_lease +method is called, we will set a configuration flag indicating that +leases are turned on:
+#define REP_C_LEASE <value>. +We will also record the u_int32_t +clock_skew value passed in. The rep_set_lease method +will not allow +calls after rep_start. If +multiple calls are made prior to calling rep_start then later +calls will +overwrite the earlier clock skew value.
+
+We need a new flag to prevent calling rep_set_lease +after rep_start. The +simplest solution would be to reject the call to +rep_set_lease +if +REP_F_CLIENT +or REP_F_MASTER is set. +However that does not work in the cases where a site cleanly closes its +environment and then opens without running recovery. The +replication state will still be set. The prevention will be +implemented as:
+

#define REP_F_START_CALLED <some bit value>

+In __rep_start, at the end:
+

if (ret == 0 ) {
	REP_SYSTEM_LOCK
	F_SET(rep, REP_F_START_CALLED)
	REP_SYSTEM_UNLOCK
}

+In __rep_env_refresh, if we +are the last reference closing the env (we already check for that):
+

F_CLR(rep, REP_F_START_CALLED);

+In order to avoid run-time floating point operations +on db_timespec structures, +when a site is declared as a client or master in rep_start we +will pre-compute the +lease duration based on the integer-based clock skew and the +integer-based lease timeout. A master should set a replica's +lease expiration to the start time of +the sent message + +(lease_timeout / clock_scale_factor) in case the replica has a +slow clock. Replicas extend their leases to received message +time + (lease_timeout * +clock_scale_factor) in case this replica has a fast clock. +Therefore, the computation will be as follows if the site is becoming a +master:
+

db_timeout_t tmp;
tmp = (db_timeout_t)((double)rep->lease_timeout / ((double)rep->clock_skew / (double)100));
rep->lease_duration = DB_TIMEOUT_TO_TIMESPEC(&tmp);

+Similarly, on a client the computation is:
+

tmp = (db_timeout_t)((double)rep->lease_timeout * ((double)rep->clock_skew / (double)100));

+When a site changes state, its lease duration will change based on +whether it is becoming a master or client and it will be recomputed +from the original values. Note that these computations, coupled +with the fact that the lease on the master is computed based on the +master's time that it sent the message means that leases on the master +are more conservatively computed than on the clients.
+
+The dbenv->rep_set_lease +method must be called after dbenv->open, +similar to dbenv->rep_set_config. +The reason is so that we can check that this is a replication +environment and we have access to the replication shared memory region.
+

Read Operations
+

+Authoritative read operations on the master with leases enabled will +abide by leases by default. We will provide a flag that allows an +operation on a master to ignore leases. All read operations +on a client imply +ignoring leases. If an application wants authoritative reads +they must forward the read requests to the master and it is the +application's responsibility to provide the forwarding. +The consensus was that forcing DB_IGNORE_LEASE +on client read operations (with leases enabled, obviously) was too +heavy handed. Read operations on the client will ignore leases, +but do no special flag checking.
+
+The flag will be called DB_IGNORE_LEASE +and it will be a flag that can be OR'd into the DB access method and +cursor operation values. It will be similar to the DB_READ_UNCOMMITTED +flag. +
+The methods that will +adhere to leases are:
+

Db->get
Db->pget
Dbc->get
Dbc->pget

+The code that will check leases for a client reading would look +something +like this, if we decide to become heavy-handed:
+

if (IS_REP_CLIENT(dbenv)) {
	[get to rep structure]
	if (FLD_ISSET(rep->config, REP_C_LEASE) && !LF_ISSET(DB_IGNORE_LEASE)) {
		db_err("Read operations must ignore leases or go to master");
		ret = EINVAL;
		goto err;
	}
}

+On the master, the new code to abide by leases is more complex. +After the call to perform the operation we will check the lease. +In that checking code, the master will see if it has a valid +lease. If so, then all is well. If not, it will try to +refresh the leases. If that refresh attempt results in leases, +all is well. If the refresh attempt does not get leases, then the +master cannot respond to the read as an authority and we return an +error. The new error is called DB_REP_LEASE_EXPIRED. +The location of the master lease check is down after the internal call +to read the data is successful:
+

if (IS_REP_MASTER(dbenv) && !LF_ISSET(DB_IGNORE_LEASE)) {
	[get to rep structure]
	if (FLD_ISSET(rep->config, REP_C_LEASE) &&
	    (ret = __rep_lease_check(dbenv)) != 0) {
		/*
		 * We don't hold the lease.
		 */
		goto err;
	}
}

+See below for the details of __rep_lease_check.
+
+Also note that if leases (or replication) are not configured, then DB_IGNORE_LEASE is a no-op. It +is ignored (and won't error) if used when leases are not in +effect. The reason is so that we can generically set that flag in +utility programs like db_dump +that walk the database with a cursor. Note that db_dump is the only utility that +reads with a cursor.
+

Nsites +and Elections

+The call to dbenv->rep_set_nsites +must be performed before the call to dbenv->rep_start +or dbenv->repmgr_start. +This document assumes either that SR +14778 gets resolved, or assumes that the value of nsites is +immutable. The +master and all clients need to know how many sites and leases are in +the group. Clients need to know for elections. The master +needs to know for the size of the lease table and to know what value a +majority of the group is. [Until +14778 is resolved, the master lease work must assume nsites is +immutable and will +therefore enforce that this is called before rep_start using +the same mechanism +as rep_set_lease.]
+
+Elections and leases need to agree on the number of sites in the +group. Therefore, when leases are in effect on clients, all calls +to dbenv->rep_elect must +set the nsites parameter to +0. The rep_elect code +path will return EINVAL if REP_C_LEASE is set and nsites +is non-0. +

Lease Management

Message Changes

+In order for clients to grant leases to the master a new message type +must be added for that purpose. This will be the REP_LEASE_GRANT +message. +Granting leases will be a result of applying a DB_REP_PERMANENT +record and therefore we +do not need any additional message in order for a master to request a +lease grant. The REP_LEASE_GRANT +message will pass a structure as its message DBT:
+

struct __rep_lease_grant {
	db_timespec msg_time;
#ifdef DIAGNOSTIC
	db_timespec expire_time;
#endif
} REP_GRANT_INFO;

+In the REP_LEASE_GRANT +message, the client is actually giving the master several pieces of +information. We only need the echoed msg_time in this +structure because +everything else is already sent. The client is really sending the +master:
+

Its EID (parameter to rep_send_message +and rep_process_message)
+
The PERM LSN this message acknowledged (sent in the control +message)
Unique identifier echoed back to master (msg_time sent in +message as above)

+On the client, we always maintain the maximum PERM LSN already in lp->max_perm_lsn. +

Local State Management

+Each client must maintain a db_timespec +timestamp containing the expiration of its granted lease. This +field will be in the replication shared memory structure:
+

db_timespec grant_expire;

+This timestamp already takes into account the clock skew. All +new fields must be initialized when the region is created. Whenever we +grant our master lease and want to send the REP_LEASE_GRANT +message, this value +will be updated. It will be used in the following way: +

db_timespec mytime;
DB_LSN perm_lsn;
DBT lease_dbt;
REP_GRANT_INFO gi;


timespecclear(&mytime);
timespecclear(&newgrant);
memset(&lease_dbt, 0, sizeof(lease_dbt));
memset(&gi, 0, sizeof(gi));
__os_gettime(dbenv, &mytime);
timespecadd(&mytime, &rep->lease_duration);
MUTEX_LOCK(rep->clientdb_mutex);
perm_lsn = lp->max_perm_lsn;
MUTEX_UNLOCK(rep->clientdb_mutex);
REP_SYSTEM_LOCK(dbenv);
if (timespeccmp(mytime, rep->grant_expire, >))
	rep->grant_expire = mytime;
gi.msg_time = msg->msg_time;
#ifdef DIAGNOSTIC
gi.expire_time = rep->grant_expire;
#endif
lease_dbt.data = &gi;
lease_dbt.size = sizeof(gi);
REP_SYSTEM_UNLOCK(dbenv);
__rep_send_message(dbenv, eid, REP_LEASE_GRANT, &perm_lsn, &lease_dbt, 0, 0);

+This updating of the lease grant will occur in the PERM code +path when we have +successfully applied the permanent record.
+

Maintaining Leases on the +Master/Rep_start

+The master maintains a lease table that it checks when fulfilling a +read request that is subject to leases. This table is initialized +when a site calls +dbenv->rep_start(DB_MASTER) and the site is undergoing a role +change (i.e. a master making additional calls to dbenv->rep_start(DB_MASTER) +does +not affect an already existing table).
+
+When a non-master site becomes master, it must do two things related to +leases on a role change. First, a client cannot upgrade to master +while it has an outstanding lease granted to another site. If a +client attempts to do so, an error, EINVAL, +will be returned. The only way this should happen is if the +application simply declares a site master, instead of using +elections. Elections will already wait for leases to expire +before proceeding. (See below.) +
+
+Second, once we are proceeding with becoming a master, the site must +allocate the table it will use to maintain lease information. +This table will be sized based on nsites +and it will be an array of the following structure:
+

struct  {
	int eid;			/* EID of client site. */
	db_timespec start_time;	/* Unique time ID client echoes back on grants. */
	db_timespec end_time;	/* Master's lease expiration time. */
	DB_LSN lease_lsn;	/* Durable LSN this lease applies to. */
	u_int32_t flags;	/* Unused for now?? */
} REP_LEASE_ENTRY;

Granting Leases

+It is the burden of the application to make sure that all sites in the +group +are using leases, or none are. Therefore, when a client processes +a PERM +log record that arrived from the master, it will grant its lease +automatically if that record is permanent (i.e. DB_REP_ISPERM +is being returned), +and leases are configured. A client will not send a +lease grant when it is processing log records (even PERM +ones) it receives from other clients that use client-to-client +synchronization. The reason is that the master requires a unique +time-of-msg ID (see below) that the client echoes back in its lease +grant and it will not have such an ID from another client.
+
+The master stores a time-of-msg ID in each message and the client +simply echoes it back to the master. In its lease table, it does +keep the base +time-of-msg for a valid lease. When REP_LEASE_GRANT +message comes in, +the master does a number of things:
+

Pulls the echoed timespec from the client message, into msg_time.
+
Finds the entry in its lease table for the client's EID. It +walks the table searching for the ID. EIDs of DB_EID_INVALID are +illegal. Either the master will find the entry, or it will find +an empty slot in the table (i.e. it is still populating the table with +leases).
If this is a previously unknown site lease, the master +initializes the entry by copying to the eid, start_time, and + lease_lsn fields. The master +also computes the end_time +based on the adjusted rep->lease_duration.
If this is a lease from a previously known site, the master must +perform timespeccmp(&msg_time, +&table[i].start_time, >) and only update the end_time +of the lease when this is +a more recent message. If it is a more recent message, then we +should update +the lease_lsn to the LSN in +the message.
Since lease durations are computed taking the clock skew into +account, clients compute them based on the current time and the master +computes it based on original sending time, for diagnostic purposes +only, I also plan to send the client's expiration time. The +client errs on the side of computing a larger lease expiration time and +the master errs on the side of computing a smaller duration. +Since both are taking the clock skew +into account, the client's ending expiration time should never be +smaller than +the master's computed expiration time or their value for clock skew may +not be correct.
+

+Any log records (new or resent) that originate from the master and +result in DB_REP_ISPERM get an +ack.
+
+

Refreshing Leases

+Leases get refreshed when a master receives a REP_LEASE_GRANT +message from a client. There are three pieces to lease +refreshment.
+

Lazy Lease Refreshing on Read
+

+If the master discovers that leases are +expired during the read operation, it attempts to refresh its +collection of lease grants. It does this by calling a new +function __rep_lease_refresh. +This function is very similar to the already-existing function __rep_flush. +Basically, to +refresh the lease, the master simply needs to resend the last PERM +record to the clients. The requirements state that when the +application send function returns successfully from sending a PERM +record, the majority of clients have that PERM LSN durable. We +will have a new public DB error return called DB_REP_LEASE_EXPIRED +that will be +returned back to the caller if the master cannot assert its +authority. The code will look something like this:
+

/*
 * Use lp->max_perm_lsn on the master (currently not used on the master)
 * to keep track of the last PERM record written through the logging system.
 * need to initialize lp->max_perm_lsn in rep_start on role_chg.
 */
call __rep_send_message on the last PERM record the master wrote, with DB_REP_PERMANENT
if failure
	expire leases
	return lease expired error to caller
else /* success */
	recheck lease table
	/*
	 * We need to recheck the lease table because the client
	 * lease grant messages may not be processed yet, or got
	 * lost, or racing with the application's ACK messages or
	 * whatever. 
	 */
	if we have a majority of valid leases
		return success
	else
		return lease expired error to caller

Ongoing Update Refreshment
+

+Second is having the master indicate to +the client it needs to send a lease grant in response to the current +PERM log message. The problem is +that acknowledgements must contain a master-supplied message timestamp +that the client sends back to the master. We need to modify the +structure of the log record messages when leases are configured +so +that when a PERM message is sent, the master sends, and the client +expects, the message timestamp. There are three fairly +straightforward and different implementations to consider.
+

Adding the timestamp to the REP_CONTROL +structure. If this option is chosen, then the code trivially +sends back the timestamp in the client's reply. There is no +special processing done by either side with the message contents. +So, on a PERM log record, the master will send a non-zero +timestamp. On a normal log record the timestamp will be zero or +some known invalid value. If the client sees a non-zero +timestamp, it sends a REP_LEASE_GRANT +with the lp->max_perm_lsn +after applying that log record. If it is zero, then the client +does nothing different. The advantage is ease of code. The +disadvantage is that for mixed version systems, the client is now +dealing with different sized control structures. We would have to +retain the old control structure so that during a mixed version group +the (upgraded) clients can use, expect and send old control structures +to the master. This is unfortunate, so let's consider additional +implementations that don't require modifying the control structure.
+
Adding a new REPCTL_LEASE +flag to the list of flags for the control structure, but do not change +the control structure fields. When a master wants to send a +message that needs a lease ack, it sets the flag. Additionally, +instead of simply sending a log record DBT as the rec parameter +for replication, we +would send a new structure that had the timestamp first and then the +record (similar to the bulk transfer buffer). The advantage of +this is that the control structure does not change. Disadvantages +include more special-cased code in the normal code path where we have +to check the flag. If the flag is set we have to extract the +timestamp value and massage the incoming data to pass on the real log +record to rep_apply. On +bulk transfer, we would just add the timestamp into the buffer. +On normal transfers, it would incur an additional data copy on the +master side. That is unfortunate. Additionally, if this +record needs to be stored in the temp db, we need some way to get it +back again later or rep_apply +would have to extract the timestamp out when it processed the record +(either live or from the temp db).
+
Adding a different message type, such as REP_LOG_ACK. +Similarly to REP_LOG_MORE this message would be a +special-case version of a log record. We would extract out the +timestamp and then handle as a normal log record. This +implementation is rejected because it actually would require three new +message types: REP_LOG_ACK, +REP_LOG_ACK_MORE, REP_BULK_LOG_ACK. That is just too ugly +to contemplate.

+[Slight digression: it occurs +to me while writing about #2 and #3 above, that our implementation of +all of the *_MORE messages could really be implemented with a REPCTL_MORE +flag instead of a +separate message type. We should clean that up and simplify the +messages but not part of master leases. Hmm, taking that thought +process further, we really could get rid of the REP_BULK_* +messages as well if we +added a REPCTL_BULK +flag. I think we should definitely do it for the *_MORE +messages. I am not sure we should do it for bulk because the +structure of the incoming data record is vastly different.]
+
+Of these options, I believe that modifying the control structure is the +best alternative. The handling of the old structure will be very +isolated to code dealing with old versions and is far less complicated +than injecting the timestamp into the log record DBT and doing a data +copy. Actually, I will likely combine #1 and the flag from #2 +above. I will have the REPCTL_LEASE +flag that indicates a lease grant reply is expected and have the +timestamp in the control structure. +Also I will probably add in a spare field or two for future use in the REP_CONTROL +structure.
+

Gap processing

+No matter which implementation we choose for ongoing lease refreshment, +gap processing must be considered. The code above assumes the +timestamps will be placed on PERM records only. Normal log +records will not have a timestamp, nor a flag or anything else like +that. However, any log message can fill a gap on a client and +result in the processing of that normal log record to return DB_REP_ISPERM +because later records +were also processed.
+
+The current implementation should work fine in that case because when +we store the message in the client temp db we store both the control +DBT and the record DBT. Therefore, when a normal record fills a +gap, the later PERM record, when retrieved will look just like it did +when it arrived. The client will have access to the LSN, and the +timestamp, etc. However, it does mean that sending the REP_LEASE_GRANT +message must take +place down in __rep_apply +because that is the only place we have access to the contents of those +stored records with the timestamps.
+
+There are two logical choices to consider for granting the lease when +processing an update. As we process (either a live record or one +read from the temp db after filling a gap) a PERM message, we send the REP_LEASE_GRANT +message for each +PERM record we successfully apply. Or, second, we keep track of +the largest timestamp of all PERM records we've processed and at the +end of the function after we've applied all records, we send back a +single lease grant with the max_perm_lsn +and a new max_lease_timestamp +value to the master. The first is easier to implement, the second +results in possibly slightly fewer messages at the expense of more +bookkeeping on the client.
+
+A third, more complicated option would be to have the message timestamp +on all records, but grants are only sent on the PERM messages. A +reason to do this is that the later timestamp of a normal log record +would be used as the timestamp sent in the reply and the master would +get a more up to date timestamp value and a longer lease.
+
+If we change the REP_CONTROL +structure to include the timestamp, we potentially break or at least +need to revisit the gap processing algorithm. That code assumes +that the control and record elements for the same LSN look the same +each and every time. The code stores the control DBT as the key and the rec DBT as the data. We use a +specialized compare function to sort based on the LSN in the control +DBT. With master leases, the same record transmitted by a master +multiple times or client for the same LSN will be different because the +timestamp field will not be the same. Therefore, the client will +end up with duplicate entries in the temp database for the same +LSN. Both solutions (adding the timestamp to REP_CONTROL and adding a REPCTL_LEASE flag) can yield +duplicate entries. The flag would cause the same record from the +master and client to be different as well.
+

Handling Incoming Lease Grants
+

+The third piece of lease management is handling the incoming REP_LEASE_GRANT +message on the +master. When this message is received, the master must do the +following:
+

REP_SYSTEM_LOCK
msg_timestamp = cntrl->timestamp;
client_lease = __rep_lease_entry(dbenv, client eid)
if (client_lease == NULL)
	initial lease for this site, DB_ASSERT there is space in the table
	add this to the table if there is space
} else 
	compare msg_timestamp with client_lease->start_time
	if (msg_timestamp is more recent && msg_lsn >= lease LSN)
		update entry in table
REP_SYSTEM_UNLOCK

Expiring Leases

+Leases can expire in two ways. First they can expire naturally +due to the passage of time. When checking leases, if the current +time is later than the lease entry's end_time +then the lease is expired. Second, they can be forced with a +premature expiration when the application's transport function returns +an error. In the first case, there is nothing to do, in the +second case we need to manipulate the end_time +so that all future lease checks fail. Since the lease start_time +is guaranteed to not be in the future we will have a function __rep_lease_expire +that will:
+

REP_SYSTEM_LOCK
for each entry in the lease table
	entry->end_time = entry->start_time;
REP_SYSTEM_UNLOCK

+Is there a potential race or problem with prematurely expiring +leases? Consider an application that enforces an ALL +acknowledgement policy for PERM records in its transport +callback. There are four clients and three send the PERM ack to +the application. The callback returns an error to the master DB +code. The DB code will now prematurely expire its leases. +However, at approximately the same time the three clients are also +sending their REP_LEASE_GRANT +messages to the master. There is a race between the master +processing those messages and the thread handling the callback failure +expiring the table. This is only an issue if the messages arrive +after the table has been expired.
+
+Let's assume all three clients send their grants after the master +expires the table. If we accept those grants and then a read +occurs the read will succeed since the master has a majority of leases +even though the callback failed earlier. Is that a problem? +The lease code is using a majority and the application policy is using +something other value. It feels like this should be okay since +the data is held by leases on a majority. Should we consider +having the lease checking threshold be the same as the permanent ack +policy? That is difficult because Base API users implement +whatever they want and DB does not know what it is.
+

Checking Leases

+When a read operation on the master completes, the last thing we need +to do is verify the master leases. We've already discussed +refreshing them when they are expired above. We need two things +for a lease to be valid. It must be within the timeframe of the +lease grant and the lease must be valid for the last PERM record +LSN. Here is the logic +for checking the validity of leases in __rep_lease_check:
+

#define MAX_REFRESH_TRIES	3
DB_LSN lease_lsn;
REP_LEASE_ENTRY *entry;
u_int32_t min_leases, valid_leases;
db_timespec cur_time;
int ret, tries;

	tries = 0;
retry:
	ret = 0;
	LOG_SYSTEM_LOCK
	lease_lsn = lp->lsn
	LOG_SYSTEM_UNLOCK
	REP_SYSTEM_LOCK
	min_leases = rep->nsites / 2;
	__os_gettime(dbenv, &cur_time);
	for (entry = head of table, valid_leases = 0; entry != NULL && valid_leases < min_leases; entry++)
		if (timespec_cmp(&entry->end_time, &cur_time) >= 0 && log_compare(&entry->lsn, lease_lsn) == 0)
			valid_leases++;
	REP_SYSTEM_UNLOCK
	if (valid_leases < min_leases) {
		ret =__rep_lease_refresh(dbenv, ...);
		/*
		 * If we are successful, we need to recheck the leases because 
		 * the lease grant messages may have raced with the PERM
		 * acknowledgement.  Give those messages a chance to arrive.
		 */
		if (ret == 0) {
			if (tries <= MAX_REFRESH_TRIES) {
				/*
				 * If we were successful sending, but not successful in racing the
				 * message thread, yield the processor so that message
				 * threads may have a chance to run.
				 */
				if (tries > 0)
					/* __os_sleep instead?? */
					__os_yield()
				tries++;
				goto retry;
			} else
				ret = DB_RET_LEASE_EXPIRED;
		}
	}
	return (ret);

+If the master has enough valid leases it returns success. If it +does not have enough, it attempts to refresh them. This attempt +may fail if sending the PERM record does not receive sufficient +acks. If we do receive sufficient acknowledgements we may still +find that scheduling of message threads means the master hasn't yet +processed the incoming REP_LEASE_GRANT +messages yet. We will retry a couple times (possibly +parameterized) if the master discovers that situation.
+

Elections

+When a client grants a lease to a master, it gives up the right to +participate in an election until that grant expires. If we are +the master and dbenv->rep_elect +is called, it should return, no matter what, like it does today. +If we are a client and rep_elect +is called special processing takes place when leases are in +effect. First, the easy case is if the lease granted by this +client has already expired, then the client goes directly into the +election as normal. If a valid lease grant is outstanding to a +master, this site cannot participate in an election until that grant +expires. We have at least two options when a site calls the dbenv->rep_elect +API while +leases are in effect.
+

The simplest coding solution for DB would be simply to refuse to +participate in the election if this site has a current lease granted to +a master. We would detect this situation and return EINVAL. +This is correct behavior and trivial to implement. The +disadvantage of this solution is that the application would then be +responsible for repeatedly attempting an election until the lease grant +expired.
+
The more satisfying solution is for DB to wait the remaining time +for the grant. If this client hears from the master during that +time the election does not take place and the call to rep_elect +returns with the +information for the current/old master.

Election Code Changes

+The code changes to support leases in the election code are fairly +isolated. First if leases are configured, we must verify the nsites +parameter is set to 0. +Second, in __rep_elect_init +we must not overwrite the value of rep->nsites +for leases because it is controlled by the dbenv->rep_set_nsites +API. +These changes are small and easy to understand.
+
+The more complicated code will be the client code when it has an +outstanding lease granted. The client will wait for the current +lease grant to expire before proceeding with the election. The +client will only do so if it does not hear from the master for the +remainder of the lease grant time. If the client hears from the +master, it returns and does not begin participating in the +election. A new election phase, REP_EPHASE0 +will exist so that the call to __rep_wait +can detect if a master responds. The client, while waiting for +the lease grant to expire, will send a REP_MASTER_REQ +message so that the master will respond with a REP_NEWMASTER +message and thus, +allow the client to know the master exists. However, it is also +desirable that if the master +replies to the client, the master wants the client to update its lease +grant.
+
+Recall that the REP_NEWMASTER +message does not result in a lease grant from the client. The +client responds when it processes a PERM record that has the REPCTL_LEASE +flag set in the message +with its lease grant up to the given LSN. Therefore, we want the +client's REP_MASTER_REQ to +yield both the discovery of the existing master and have the master +refresh its leases. The client will also use the REPCTL_LEASE +flag in its REP_MASTER_REQ message to the +master. This flag will serve as the indicator to the master that +it needs to deal with leases and both send the REP_NEWMASTER +message and refresh +the lease.
+The code will work as follows:
+

if (leases_configured && (my_grant_still_valid || lease_never_granted) {
	if (lease_never_granted)
		wait_time = lease_timeout
	else
		wait_time = grant_expiration - current_time
	F_SET(REP_F_EPHASE0);
	__rep_send_message(..., REP_MASTER_REQ, ... REPCTL_LEASE);
	ret = __rep_wait(..., REP_F_EPHASE0);
	if (we found a master)
		return
} /* if we don't return, fall out and proceed with election */

+On the master side, the code handling the REP_MASTER_REQ will +do:
+

if (I am master) {
	...
	__rep_send_message(REP_NEWMASTER...)
	if (F_ISSET(rp, REPCTL_LEASE))
		__rep_lease_refresh(...)
}

+Other minor implementation details are that __rep_elect_done +must also clear +the REP_F_EPHASE0 flag. +We also, obviously, need to define REP_F_EPHASE0 +in the list of replication flags. Note that the client's call to __rep_wait +will return upon +receiving the REP_NEWMASTER +message. The client will independently refresh its lease when it +receives the log record from the master's call to refresh the lease.
+
+Again, similar to what I suggested above, the code could simply assume +global leases are configured, and instead of having the REPCTL_LEASE +flag at all, the master +assumes that it needs to refresh leases because it has them configured, +not because it is specified in the REP_MASTER_REQ +message it is processing. Right now I don't think every possible +REP_MASTER_REQ message should result in a lease grant request.
+

Elections and Quiescient Systems

+It is possible that a master is slow or the client is close to its +expiration time, or that the master is quiescient and all leases are +currently expired, but nothing much is going on anyway, yet some client +calls __rep_elect at that +time. In the code above, we will not send the REP_MASTER_REQ +because the lease is +not valid. The client will simply proceed directly to sending the +REP_VOTE1 message, throwing all +other clients into an election. The master is still master and +should stay that way. Currently in response to a vote message, a +master will broadcast out a REP_NEWMASTER +to assert its mastership. That causes the election to +complete. However, if desired the master may want to proactively +refresh its leases. This situation indicates to me that the +master should choose to refresh leases based on configuration, not a +flag sent from the client. I believe anytime the master asserts +its mastership via sending a REP_NEWMASTER +message that I need to add code to proactively refresh leases at that +time.
+

Other Implementation Details

Role Changes
+

+When a site changes its role via a call to rep_start in either +direction, we +must take action when leases are configured. There are three +types of role changes that all need changes to deal with leases:
+

A master downgrading to a +client. When a master downgrades to a client, it can do so +immediately after it has proactively expired all existing leases it +holds. This situation is similar to an error from the send +callback, and it effectively cancels all outstanding leases held on +this site. Note that if this master expires its leases, it does +not have any effect on when the clients' lease grants expire on the +client side. The clients must still wait their full expected +grant time.
+
A client upgrading to master. +If a client is upgrading to a master but it has an outstanding lease +granted to another site, the code will return an EINVAL +error. This situation +only arises if the application simply declares this site master. +If a site wins an election then the election itself should have waited +long enough for the granted lease to expire and this state should not +arise then.
A client finding a new master. +When a client discovers a new and different master, via a REP_NEWMASTER +message then the +client cannot accept that new master until its current lease grant +expires. This situation should only occur when a site declares +itself master without an election and that site's lease grant expires +before this client's grant expires. However, it is possible +for this situation to arise +with elections also. If we have 5 sites holding an election and 4 +of those sites have leases expire at about the same time T, and this +site's lease expires at time T+N and the election timeout is < N, +then those 4 sites may hold an election and elect a master without this +site's participation. A client in this situation must call __rep_wait +with the time remaining +on its lease. If the lease is expired after waiting the remaining +time, then the client can accept this new master. If the lease +was refreshed during the waiting period then the client does not accept +this new master and returns.
+

DUPMASTER

+A duplicate master situation can occur if an old master becomes +disconnected from the rest of the group, that group elects a new master +and then the partition is resolved. The requirement for master +leases is that this situation will not cause the newly elected, +rightful master to receive the DB_REP_DUPMASTER +return. It is okay for the old master to get that return +value. When a dual master situation exists, the following will +happen:
+

On the current master and all +current clients - If the current master receives an update +message or other conflicting message from the old master then that +message will be ignored because the generation number is out of date.
On the old master - If +the old master receives an update message from the current master, or +any other message with a later generation from any site, the new +generation number will trigger this site to return DB_REP_DUPMASTER. +However, +instead of broadcasting out the REP_DUPMASTER +message to shoot down others as well, this site, if leases are +configured, will call __rep_lease_check +and if they are expired, return the error. It should be +impossible for us to receive a later generation message and still hold +a majority of master leases. Something is seriously wrong and we +will DB_ASSERT this situation +cannot happen.
+

Client to Client Synchronization

+One question to ask is how lease grants interact with client-to-client +synchronization. The only answer is that they do not. A client +that is sending log records to another client cannot request the +receiving client refresh its lease with the master. That client +does not have a timestamp it can use for the master and clock skew +makes it meaningless between machines. Therefore, sites that use +client-to-client synchronization will likely see more lease refreshment +during the read path and leases will be refreshed during live updates +only. Of course, if a client supplies log records that fill a +gap, and the later log records stored came from the master in a live +update then the client will respond as per the discussion on Gap +Processing above.
+

Interaction Matrix

+If leases are granted (by a client) or held (by a master) what should +the following APIs and messages do?
+
+Other:
+log_archive: Leases do not affect log_archive. OK.
+dbenv->close: OK.
+crash during lease grant and restart: Potential +problem here. See discussion below.
+
+Rep Base API method:
+rep_elect: Already discussed above. Must wait for lease to expire.
+rep_flush: Master only, OK - this will be the basis for refreshing +leases.
+rep_get_*: Not affected by leases.
+rep_process_message: Generally OK. We'll discuss each message +below.
+rep_set_config: OK.
+rep_set_limit: OK
+rep_set_nsites: Must be called before rep_start +and nsites is immutable until +14778 is resolved.
+rep_set_priority: OK
+rep_set_timeout: OK. Used to set lease timeout.
+rep_set_transport: OK.
+rep_start(MASTER): Role changes are discussed above. Make sure +duplicate rep_start calls are no-ops for leases.
+rep_start(CLIENT): Role changes are discussed above. Make sure +duplicate calls are no-ops for leases.
+rep_stat: OK.
+rep_sync: Should not be able to happen. Client cannot accept new +master with outstanding lease grant. Add DB_ASSERT here.
+
+REP_ALIVE: OK.
+REP_ALIVE_REQ: OK.
+REP_ALL_REQ: OK.
+REP_BULK_LOG: OK. Clients check to send ACK.
+REP_BULK_PAGE: Should never process one with lease granted. Add +DB_ASSERT.
+REP_DUPMASTER: Should never happen, this is what leases are supposed to +prevent. See above.
+REP_LOG: OK. Clients check to send ACK.
+REP_LOG_MORE: OK. Clients check to send ACK.
+REP_LOG_REQ: OK.
+REP_MASTER_REQ: OK.
+REP_NEWCLIENT: OK.
+REP_NEWFILE: OK. Clients check to send ACK.
+REP_NEWMASTER: See above.
+REP_NEWSITE: OK.
+REP_PAGE: OK. Should never process one with lease granted. +Add DB_ASSERT.
+REP_PAGE_FAIL: OK. Should never process one with lease +granted. Add DB_ASSERT.
+REP_PAGE_MORE: OK. Should never process one with lease +granted. Add DB_ASSERT.
+REP_PAGE_REQ: OK.
+REP_REREQUEST: OK.
+REP_UPDATE: OK. Should never process one with lease +granted. Add DB_ASSERT.
+REP_UPDATE_REQ: OK. This is a master-only message.
+REP_VERIFY: OK. Should never process one with lease +granted. Add DB_ASSERT.
+REP_VERIFY_FAIL: OK. Should never process one with lease +granted. Add DB_ASSERT.
+REP_VERIFY_REQ: OK.
+REP_VOTE1: OK. See Election discussion above. It is +possible to receive one with a lease granted. Client cannot send +one with an outstanding lease however.
+REP_VOTE2: OK. See Election discussion above. It is +possible to receive one with a lease granted.
+
+If the following method or message processing is in progress and a +client wants to grant a lease, what should it do? Let's examine +what this means. The client wanting to grant a lease simply means +it is responding to the receipt of a REP_LOG +(or its variants) message and applying a log record. Therefore, +we need to consider a thread processing a log message racing with these +other actions.
+
+Other:
+log_archive: OK.
+dbenv->close: User error. User should not be closing the env +while other threads are using that handle. Should have no effect +if a 2nd dbenv handle to same env is closed.
+
+Rep Base API method:
+rep_elect: See Election discussion above. rep_elect +should wait and may grant +lease while election is in progress.
+rep_flush: Should not be called on client.
+rep_get_*: OK.
+rep_process_message: Generally OK. See handling each message +below.
+rep_set_config: OK.
+rep_set_limit: OK.
+rep_set_nsites: Must be called before rep_start +until 14778 is resolved.
+rep_set_priority: OK.
+rep_set_timeout: OK.
+rep_set_transport: OK.
+rep_start(MASTER): OK, can't happen - already protect racing rep_start +and rep_process_message.
+rep_start(CLIENT): OK, can't happen - already protect racing rep_start +and rep_process_message.
+rep_stat: OK.
+rep_sync: Shouldn't happen because client cannot grant leases during +sync-up. Incoming log message ignored.
+
+REP_ALIVE: OK.
+REP_ALIVE_REQ: OK.
+REP_ALL_REQ: OK.
+REP_BULK_LOG: OK.
+REP_BULK_PAGE: OK. Incoming log message ignored during internal +init.
+REP_DUPMASTER: Shouldn't happen. See DUPMASTER discussion above.
+REP_LOG: OK.
+REP_LOG_MORE: OK.
+REP_LOG_REQ: OK.
+REP_MASTER_REQ: OK.
+REP_NEWCLIENT: OK.
+REP_NEWFILE: OK.
+REP_NEWMASTER: See above. If a client accepts a new master +because its lease grant expired, then that master sends a message +requesting the lease grant, this client will not process the log record +if it is in sync-up recovery, or it may after the master switch is +complete and the client doesn't need sync-up recovery. Basically, +just uses existing log record processing/newmaster infrastructure.
+REP_NEWSITE: OK.
+REP_PAGE: OK. Receiving a log record during internal init PAGE +phase should ignore log record.
+REP_PAGE_FAIL: OK.
+REP_PAGE_MORE: OK.
+REP_PAGE_REQ: OK.
+REP_REREQUEST: OK.
+REP_UPDATE: OK. Receiving a log record during internal init +should ignore log record.
+REP_UPDATE_REQ: OK - master-only message.
+REP_VERIFY: OK. Receiving a log record during verify phase +ignores log record.
+REP_VERIFY_FAIL: OK.
+REP_VERIFY_REQ: OK.
+REP_VOTE1: OK. This client is processing someone else's vote when +the lease request comes in. That is fine. We protect our +own election and lease interaction in __rep_elect.
+REP_VOTE2: OK.
+

Crashing - Potential Problem
+

+It appears there is one area where we could have a problem. I +believe that crashes can cause us to break our guarantee on durability, +authoritative reads and inability to elect duplicate masters. +Consider this scenario:
+

A master and 4 clients are all up and running.
The master commits a txn and all 4 clients refresh their lease +grants at time T.
All 4 clients have the txn and log records in the cache. +None are flushing to disk.
All 4 clients have responded to the PERM messages as well as +refreshed their lease with the master.
All 4 clients hit the same application coding error and crash +(machine/OS stays up).
Master authoritatively reads data in txn from step 2.
All 4 clients restart the application and run recovery, thus the +txn from step 2 is lost on all clients because it isn't any logs.
+
A network partition happens and the master is alone on its side.
All 4 clients are on the other side and elect a new master.
Partition resolves itself and we have duplicate masters, where +the former master still holds all valid lease grants.
+

+Therefore, we have broken both guarantees. In step 6 the data is +really not durable and we've given it to the user. One can argue +that if this is an issue the application better be syncing somewhere if +they really want durability. However, worse than that is that we +have a legitimate DUPMASTER situation in step 10 where both masters +hold valid leases. The reason is that all lease knowledge is in +the shared memory and that is lost when the app restarts and runs +recovery.
+
+How can we solve this? The obvious solution is (ugh, yet another) +durable BDB-owned file with some information in it, such as the current +lease expiration time so that rebooting after a crash leaves the +knowledge that the lease was granted. However, writing and +syncing every lease grant on every client out to disk is far too +expensive.
+
+A second possible solution is to have clients wait a full lease timeout +before entering an election the first time. This solution solves the +DUPMASTER issue, but not the non-authoritative read. This +solution naturally falls out of elections and leases really. If a +client has never granted a lease, it should be considered as having to +wait a full lease timeout before entering an election. +Applications already know that leases impact elections and this does +not seem so bad as it is only on the first election.
+
+Is it sufficient to document that the authoritative read is only as +authoritative as the durability guarantees they make on the sites that +indicate it is permanent? Yes, I believe this is sufficient. If +the application says it is permanent and it really isn't, then the +application is at fault. Believing the application when it +indicates with the PERM response that it is permanent avoids the +authoritative problem.
+

Upgrade/Mixed Versions

+Clearly leases cannot be used with mixed version sites since masters +running older releases will not have any knowledge of lease +support. What considerations are needed in the lease code for +mixed versions?
+
+First if the REP_CONTROL +structure changes, we need to maintain and use an old version of the +structure for talking to older clients and masters. The +implementation of this would be similar to the way we manage for old REP_VOTE_INFO +structures. +Second any new messages need translation table entries added. +Third, if we are assuming global leases then clearly any mixed versions +cannot have leases configured, and leases cannot be used in mixed +version groups. Maintaining two versions of the control structure +is not necessary if we choose a different style of implementation and +don't change the control structure.
+
+However, then how could an old application both run continuously, +upgrade to the new release and take advantage of leases without taking +down the entire application? I believe it is possible for clients +to be configured for leases but be subject to the master regarding +leases, yet the master code can assume that if it has leases +configured, all client sites do as well. In several places above +I suggested that a client could make a choice based on either a new REPCTL_LEASE +flag or simply having +leases turned on locally. If we choose to use the flag, then we +can support leases with mixed versions. The upgraded clients can +configure leases and they simply will not be granted until the old +master is upgraded and send PERM message with the flag indicating it +wants a lease grant. The client will not grant a lease until such +time. The clients, while having the leases configured, will not +grant a lease until told to do so and will simply have an expired +lease. Then, when the old master finally upgrades, it too can +configure leases and suddenly all sites are using them. I believe +this should work just fine and I will need to make sure a client's +granting of leases is only in response to the master asking for a +grant. If the master never asks, then the client has them +configured, but doesn't grant them.
+

Testing

+Clearly any user-facing API changes will need the equivalent reflection +in the Tcl API for testing, under CONFIG_TEST.
+
+I am sure the list of tests will grow but off the top of my head:
+Basic test: have N sites all configure leases, run some, read on +master, etc.
+Refresh test: Perform update on master, sleep until past expiration, +read on master and make sure leases are refreshed/read successful
+Error test: Test error conditions (reading on client with leases but no +ignore flag, calling after rep_start, etc)
+Read test: Test reading on both client and master both with and without +the IGNORE flag. Test that data read with the ignore flag can be +rolled back.
+Dupmaster test: Force a DUPMASTER situation and verify that the newer +master cannot get DUPMASTER error.
+Election test: Call election while grant is outstanding and master +exists.
+Call election while grant is outstanding and master does not exist.
+Call election after expiration on quiescient system with master +existing.
+Run with a group where some members have leases configured and other do +not to make sure we get errors instead of dumping core.
+
+
+ + + -- cgit v1.2.3