From 54df2afaa61c6a03cbb4a33c9b90fa572b6d07b8 Mon Sep 17 00:00:00 2001 From: Jesse Morgan Date: Sat, 17 Dec 2016 21:28:53 -0800 Subject: Berkeley DB 4.8 with rust build script for linux. --- db-4.8.30/rep/mlease.html | 1197 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1197 insertions(+) create mode 100644 db-4.8.30/rep/mlease.html (limited to 'db-4.8.30/rep/mlease.html') diff --git a/db-4.8.30/rep/mlease.html b/db-4.8.30/rep/mlease.html new file mode 100644 index 0000000..85b0aca --- /dev/null +++ b/db-4.8.30/rep/mlease.html @@ -0,0 +1,1197 @@ + + + + + + Master Lease + + +
+

Master Leases for Berkeley DB

+
+
Susan LoVerso
+sue@sleepycat.com
+Rev 1.1
+2007 Feb 2
+
+


+

+

What are Master Leases?

+A master lease is a mechanism whereby clients grant master-ship rights +to a site and that master, by holding lease rights can provide a  +guarantee of durability to a replication group for a given period of +time.  By granting a lease to a master, +a  client will not participate in an election to elect a new +master until that granted master lease has expired.  By holding a +collection of granted leases, a master will be able to supply +authoritative read requests to applications.  By holding leases a +read operation on a master can guarantee several things to the +application:
+
    +
  1. Authoritative reads: a guarantee that the data being read by the +application is durable and can never be rolled back.
  2. +
  3. Freshness: a guarantee that the data being read by the +application at the master is +not stale.
  4. +
  5. Master viability: a guarantee that a current master with valid +leases will not encounter a duplicate master situation.
    +
  6. +
+

Requirements

+The requirements of DB to support this include:
+ +The requirements of the +application using leases include:
+ +

There are some open questions +remaining.

+ +

API Changes

+The API changes that are visible +to the user are fairly minimal.  +There are a few API calls they need to make to configure master leases +and then there is the API call to turn them on.  There is also a +new flag to existing APIs to allow read operations to ignore leases and +return data that +may be non-durable potentially.
+

Lease Timeout
+

+There is a new timout the user +must configure for leases called DB_REP_LEASE_TIMEOUT.  +This timeout will be new to +the dbenv->rep_set_timeout method. The DB_REP_LEASE_TIMEOUT +has no default and it is required that the user configure a timeout +before they turn on leases (obviously, this timeout need not be set of +leases will not be used).  That timeout is the amount of time +the lease is valid on the master and how long it is granted +on the client.  This timeout must be the same +value on all sites (like log file size).  The timeout used when +refreshing leases is the DB_REP_ACK_TIMEOUT +for RepMgr application.  For Base API applications, lease +refreshes will use the same mechanism as PERM messages and they +should +have no additional burden.  This timeout is used for lease +refreshment and is the amount of time a reader will wait to refresh +leases before returning failure to the application from a read +operation.
+
+This timeout will be both stored +with its original value, and also +converted to a db_timespec +using the DB_TIMEOUT_TO_TIMESPEC +macro and have the clock skew accounted for and stored in the shared +rep structure:
+
db_timeout_t lease_timeout;
db_timespec lease_duration;
+NOTE:  By sending the lease refresh during DB operations, we are +forcing/assuming that the operation's process has a replication +transport function set.  That is obviously the case for write +operations, but would it be a burden for read processes (on a +master)?  I think mostly not, but if we need leases for +DB->stat then we need to +document it as it is certainly possible for an application to have a +separate or dedicated stat +application or attempt to use db_stat +(which will not work if leases must be checked).
+
+Leases should be checked after the local operation so that we don't +have a window/boundary if we were to check leases first, get +descheduled, the lose our lease and then perform the operation.  +Do the operation, then check leases before returning to the user.
+

Using Leases

+There is a new API that the user must call to tell the system to use +the lease mechanism.  The method must be called before the +application calls dbenv->rep_start +or dbenv->repmgr_start. +This new +method is:
+
+
    dbenv->rep_set_lease(DB_ENV *dbenv, u_int32_t clock_scale_factor, u_int32_t flags)
+
+The clock_scale_factor +parameter is interpreted as a percentage, greater than 100 (to transmit +a floating point number as an integer to the API) that represents the +maximum shkew between any two sites' clocks.  That is, a clock_scale_factor of 150 suggests +that the greatest discrepancy between clocks is that one runs 50% +faster than the others.  Both the +master and client sides +compensate for possible clock skew.  The master uses the value to +compensate in case the replica has a slow clock and replicas compensate +in case they have a fast clock.  This scaling factor will need to +be divided by 100 on all sites to truly represent the percentage for +adjustments made to time values.
+
+Assume the slowest replica's clock is a factor of clock_scale_factor +slower than the +fastest clock.  Using that assumption, if the fastest clock goes +from time t1 to t2 in X +seconds, the slowest clock does it in (clock_scale_factor / 100) +* X seconds.
+
+The flags parameter is not +currently used.
+
+When the dbenv->rep_set_lease +method is called, we will set a configuration flag indicating that +leases are turned on:
+#define REP_C_LEASE <value>.  +We will also record the u_int32_t +clock_skew value passed in.  The rep_set_lease method +will not allow +calls after rep_start.  If +multiple calls are made prior to calling rep_start then later +calls will +overwrite the earlier clock skew value. 
+
+We need a new flag to prevent calling rep_set_lease +after rep_start.  The +simplest solution would be to reject the call to +rep_set_lease  +if +REP_F_CLIENT +or REP_F_MASTER is set.  +However that does not work in the cases where a site cleanly closes its +environment and then opens without running recovery.  The +replication state will still be set.  The prevention will be +implemented as:
+
#define REP_F_START_CALLED <some bit value>
+In __rep_start, at the end:
+
if (ret == 0 ) {
REP_SYSTEM_LOCK
F_SET(rep, REP_F_START_CALLED)
REP_SYSTEM_UNLOCK
}
+In __rep_env_refresh, if we +are the last reference closing the env (we already check for that):
+
F_CLR(rep, REP_F_START_CALLED);
+In order to avoid run-time floating point operations +on db_timespec structures, +when a site is declared as a client or master in rep_start we +will pre-compute the +lease duration based on the integer-based clock skew and the +integer-based lease timeout.  A master should set a replica's +lease expiration to the start time of +the sent message + +(lease_timeout / clock_scale_factor) in case the replica has a +slow clock.  Replicas extend their leases to received message +time + (lease_timeout * +clock_scale_factor) in case this replica has a fast clock.  +Therefore, the computation will be as follows if the site is becoming a +master:
+
db_timeout_t tmp;
tmp = (db_timeout_t)((double)rep->lease_timeout / ((double)rep->clock_skew / (double)100));
rep->lease_duration = DB_TIMEOUT_TO_TIMESPEC(&tmp);
+Similarly, on a client the computation is:
+
tmp = (db_timeout_t)((double)rep->lease_timeout * ((double)rep->clock_skew / (double)100));
+When a site changes state, its lease duration will change based on +whether it is becoming a master or client and it will be recomputed +from the original values.  Note that these computations, coupled +with the fact that the lease on the master is computed based on the +master's time that it sent the message means that leases on the master +are more conservatively computed than on the clients.
+
+The dbenv->rep_set_lease +method must be called after dbenv->open, +similar to dbenv->rep_set_config.  +The reason is so that we can check that this is a replication +environment and we have access to the replication shared memory region.
+

Read Operations
+

+Authoritative read operations on the master with leases enabled will +abide by leases by default.  We will provide a flag that allows an +operation on a master to ignore leases.  All read operations +on a client imply +ignoring leases. If an application wants authoritative reads +they must forward the read requests to the master and it is the +application's responsibility to provide the forwarding. +The consensus was that forcing DB_IGNORE_LEASE +on client read operations (with leases enabled, obviously) was too +heavy handed.  Read operations on the client will ignore leases, +but do no special flag checking.
+
+The flag will be called DB_IGNORE_LEASE +and it will be a flag that can be OR'd into the DB access method and +cursor operation values.  It will be similar to the DB_READ_UNCOMMITTED +flag. +
+The methods that will +adhere to leases are:
+ +The code that will check leases for a client reading would look +something +like this, if we decide to become heavy-handed:
+
if (IS_REP_CLIENT(dbenv)) {
[get to rep structure]
if (FLD_ISSET(rep->config, REP_C_LEASE) && !LF_ISSET(DB_IGNORE_LEASE)) {
db_err("Read operations must ignore leases or go to master");
ret = EINVAL;
goto err;
}
}
+On the master, the new code to abide by leases is more complex.  +After the call to perform the operation we will check the lease.  +In that checking code, the master will see if it has a valid +lease.  If so, then all is well.  If not, it will try to +refresh the leases.  If that refresh attempt results in leases, +all is well.  If the refresh attempt does not get leases, then the +master cannot respond to the read as an authority and we return an +error.  The new error is called DB_REP_LEASE_EXPIRED.  +The location of the master lease check is down after the internal call +to read the data is successful:
+
if (IS_REP_MASTER(dbenv) && !LF_ISSET(DB_IGNORE_LEASE)) {
[get to rep structure]
if (FLD_ISSET(rep->config, REP_C_LEASE) &&
(ret = __rep_lease_check(dbenv)) != 0) {
/*
* We don't hold the lease.
*/
goto err;
}
}
+See below for the details of __rep_lease_check.
+
+Also note that if leases (or replication) are not configured, then DB_IGNORE_LEASE is a no-op.  It +is ignored (and won't error) if used when leases are not in +effect.  The reason is so that we can generically set that flag in +utility programs like db_dump +that walk the database with a cursor.  Note that db_dump is the only utility that +reads with a cursor.
+

Nsites +and Elections

+The call to dbenv->rep_set_nsites +must be performed before the call to dbenv->rep_start +or dbenv->repmgr_start.  +This document assumes either that SR +14778 gets resolved, or assumes that the value of nsites is +immutable.  The +master and all clients need to know how many sites and leases are in +the group.  Clients need to know for elections.  The master +needs to know for the size of the lease table and to know what value a +majority of the group is. [Until +14778 is resolved, the master lease work must assume nsites is +immutable and will +therefore enforce that this is called before rep_start using +the same mechanism +as rep_set_lease.]
+
+Elections and leases need to agree on the number of sites in the +group.  Therefore, when leases are in effect on clients, all calls +to dbenv->rep_elect must +set the nsites parameter to +0.  The rep_elect code +path will return EINVAL if REP_C_LEASE is set and nsites +is non-0. +

Lease Management

+

Message Changes

+In order for clients to grant leases to the master a new message type +must be added for that purpose.  This will be the REP_LEASE_GRANT +message.  +Granting leases will be a result of applying a DB_REP_PERMANENT +record and therefore we +do not need any additional message in order for a master to request a +lease grant.  The REP_LEASE_GRANT +message will pass a structure as its message DBT:
+
struct __rep_lease_grant {
db_timespec msg_time;
#ifdef DIAGNOSTIC
db_timespec expire_time;
#endif
} REP_GRANT_INFO;
+In the REP_LEASE_GRANT +message, the client is actually giving the master several pieces of +information.  We only need the echoed msg_time in this +structure because +everything else is already sent.  The client is really sending the +master:
+ +On the client, we always maintain the maximum PERM LSN already in lp->max_perm_lsn.  +

Local State Management

+Each client must maintain a db_timespec +timestamp containing the expiration of its granted lease.  This +field will be in the replication shared memory structure:
+
db_timespec grant_expire;
+This timestamp already takes into account the clock skew.  All +new fields must be initialized when the region is created. Whenever we +grant our master lease and want to send the REP_LEASE_GRANT +message, this value +will be updated.  It will be used in the following way: +
db_timespec mytime;
DB_LSN perm_lsn;
DBT lease_dbt;
REP_GRANT_INFO gi;


timespecclear(&mytime);
timespecclear(&newgrant);
memset(&lease_dbt, 0, sizeof(lease_dbt));
memset(&gi, 0, sizeof(gi));
__os_gettime(dbenv, &mytime);
timespecadd(&mytime, &rep->lease_duration);
MUTEX_LOCK(rep->clientdb_mutex);
perm_lsn = lp->max_perm_lsn;
MUTEX_UNLOCK(rep->clientdb_mutex);
REP_SYSTEM_LOCK(dbenv);
if (timespeccmp(mytime, rep->grant_expire, >))
rep->grant_expire = mytime;
gi.msg_time = msg->msg_time;
#ifdef DIAGNOSTIC
gi.expire_time = rep->grant_expire;
#endif
lease_dbt.data = &gi;
lease_dbt.size = sizeof(gi);
REP_SYSTEM_UNLOCK(dbenv);
__rep_send_message(dbenv, eid, REP_LEASE_GRANT, &perm_lsn, &lease_dbt, 0, 0);
+This updating of the lease grant will occur in the PERM code +path when we have +successfully applied the permanent record.
+

Maintaining Leases on the +Master/Rep_start

+The master maintains a lease table that it checks when fulfilling a +read request that is subject to leases.  This table is initialized +when a site calls +dbenv->rep_start(DB_MASTER) and the site is undergoing a role +change (i.e. a master making additional calls to dbenv->rep_start(DB_MASTER) +does +not affect an already existing table).
+
+When a non-master site becomes master, it must do two things related to +leases on a role change.  First, a client cannot upgrade to master +while it has an outstanding lease granted to another site.  If a +client attempts to do so, an error, EINVAL, +will be returned.  The only way this should happen is if the +application simply declares a site master, instead of using +elections.  Elections will already wait for leases to expire +before proceeding. (See below.) +
+
+Second, once we are proceeding with becoming a master, the site must +allocate the table it will use to maintain lease information.  +This table will be sized based on nsites +and it will be an array of the following structure:
+
struct  {
int eid; /* EID of client site. */
db_timespec start_time; /* Unique time ID client echoes back on grants. */
db_timespec end_time; /* Master's lease expiration time. */
DB_LSN lease_lsn; /* Durable LSN this lease applies to. */
u_int32_t flags; /* Unused for now?? */
} REP_LEASE_ENTRY;
+

Granting Leases

+It is the burden of the application to make sure that all sites in the +group +are using leases, or none are.  Therefore, when a client processes +a PERM +log record that arrived from the master, it will grant its lease +automatically if that record is permanent (i.e. DB_REP_ISPERM +is being returned), +and leases are configured.  A client will not send a +lease grant when it is processing log records (even PERM +ones) it receives from other clients that use client-to-client +synchronization.  The reason is that the master requires a unique +time-of-msg ID (see below) that the client echoes back in its lease +grant and it will not have such an ID from another client.
+
+The master stores a time-of-msg ID in each message and the client +simply echoes it back to the master.  In its lease table, it does +keep the base +time-of-msg for a valid lease.  When REP_LEASE_GRANT +message comes in, +the master does a number of things:
+
    +
  1. Pulls the echoed timespec from the client message, into msg_time.
    +
  2. +
  3. Finds the entry in its lease table for the client's EID.  It +walks the table searching for the ID.  EIDs of DB_EID_INVALID are +illegal.  Either the master will find the entry, or it will find +an empty slot in the table (i.e. it is still populating the table with +leases).
  4. +
  5. If this is a previously unknown site lease, the master +initializes the entry by copying to the eid, start_time, and + lease_lsn fields.  The master +also computes the end_time +based on the adjusted rep->lease_duration.
  6. +
  7. If this is a lease from a previously known site, the master must +perform timespeccmp(&msg_time, +&table[i].start_time, >) and only update the end_time +of the lease when this is +a more recent message.  If it is a more recent message, then we +should update +the lease_lsn to the LSN in +the message.
  8. +
  9. Since lease durations are computed taking the clock skew into +account, clients compute them based on the current time and the master +computes it based on original sending time, for diagnostic purposes +only, I also plan to send the client's expiration time.  The +client errs on the side of computing a larger lease expiration time and +the master errs on the side of computing a smaller duration.  +Since both are taking the clock skew +into account, the client's ending expiration time should never be +smaller than +the master's computed expiration time or their value for clock skew may +not be correct.
    +
  10. +
+Any log records (new or resent) that originate from the master and +result in DB_REP_ISPERM get an +ack.
+
+

Refreshing Leases

+Leases get refreshed when a master receives a REP_LEASE_GRANT +message from a client. There are three pieces to lease +refreshment. 
+

Lazy Lease Refreshing on Read
+

+If the master discovers that leases are +expired during the read operation, it attempts to refresh its +collection of lease grants.  It does this by calling a new +function __rep_lease_refresh.  +This function is very similar to the already-existing function __rep_flush.  +Basically, to +refresh the lease, the master simply needs to resend the last PERM +record to the clients.  The requirements state that when the +application send function returns successfully from sending a PERM +record, the majority of clients have that PERM LSN durable.  We +will have a new public DB error return called DB_REP_LEASE_EXPIRED +that will be +returned back to the caller if the master cannot assert its +authority.  The code will look something like this:
+
/*
* Use lp->max_perm_lsn on the master (currently not used on the master)
* to keep track of the last PERM record written through the logging system.
* need to initialize lp->max_perm_lsn in rep_start on role_chg.
*/
call __rep_send_message on the last PERM record the master wrote, with DB_REP_PERMANENT
if failure
expire leases
return lease expired error to caller
else /* success */
recheck lease table
/*
* We need to recheck the lease table because the client
* lease grant messages may not be processed yet, or got
* lost, or racing with the application's ACK messages or
* whatever.
*/
if we have a majority of valid leases
return success
else
return lease expired error to caller
+

Ongoing Update Refreshment
+

+Second is having the master indicate to +the client it needs to send a lease grant in response to the current +PERM log message.  The problem is +that acknowledgements must contain a master-supplied message timestamp +that the client sends back to the master.  We need to modify the +structure of the  log record messages when leases are configured +so +that when a PERM message is sent, the master sends, and the client +expects, the message timestamp.  There are three fairly +straightforward and different implementations to consider.
+
    +
  1. Adding the timestamp to the REP_CONTROL +structure.  If this option is chosen, then the code trivially +sends back the timestamp in the client's reply.  There is no +special processing done by either side with the message contents.  +So, on a PERM log record, the master will send a non-zero +timestamp.  On a normal log record the timestamp will be zero or +some known invalid value.  If the client sees a non-zero +timestamp, it sends a REP_LEASE_GRANT +with the lp->max_perm_lsn +after applying that log record.  If it is zero, then the client +does nothing different.  The advantage is ease of code.  The +disadvantage is that for mixed version systems, the client is now +dealing with different sized control structures.  We would have to +retain the old control structure so that during a mixed version group +the (upgraded) clients can use, expect and send old control structures +to the master.  This is unfortunate, so let's consider additional +implementations that don't require modifying the control structure.
    +
  2. +
  3. Adding a new REPCTL_LEASE +flag to the list of flags for the control structure, but do not change +the control structure fields.  When a master wants to send a +message that needs a lease ack, it sets the flag.  Additionally, +instead of simply sending a log record DBT as the rec parameter +for replication, we +would send a new structure that had the timestamp first and then the +record (similar to the bulk transfer buffer).  The advantage of +this is that the control structure does not change.  Disadvantages +include more special-cased code in the normal code path where we have +to check the flag.  If the flag is set we have to extract the +timestamp value and massage the incoming data to pass on the real log +record to rep_apply.  On +bulk transfer, we would just add the timestamp into the buffer.  +On normal transfers, it would incur an additional data copy on the +master side.  That is unfortunate.  Additionally, if this +record needs to be stored in the temp db, we need some way to get it +back again later or rep_apply +would have to extract the timestamp out when it processed the record +(either live or from the temp db).
    +
  4. +
  5. Adding a different message type, such as REP_LOG_ACK.  +Similarly to REP_LOG_MORE this message would be a +special-case version of a log record.  We would extract out the +timestamp and then handle as a normal log record.  This +implementation is rejected because it actually would require three new +message types: REP_LOG_ACK, +REP_LOG_ACK_MORE, REP_BULK_LOG_ACK.  That is just too ugly +to contemplate.
  6. +
+[Slight digression: it occurs +to me while writing about #2 and #3 above, that our implementation of +all of the *_MORE messages could really be implemented with a REPCTL_MORE +flag instead of a +separate message type.  We should clean that up and simplify the +messages but not part of master leases. Hmm, taking that thought +process further, we really could get rid of the REP_BULK_* +messages as well if we +added a REPCTL_BULK +flag.  I think we should definitely do it for the *_MORE +messages.  I am not sure we should do it for bulk because the +structure of the incoming data record is vastly different.]
+
+Of these options, I believe that modifying the control structure is the +best alternative.  The handling of the old structure will be very +isolated to code dealing with old versions and is far less complicated +than injecting the timestamp into the log record DBT and doing a data +copy.  Actually, I will likely combine #1 and the flag from #2 +above.  I will have the REPCTL_LEASE +flag that indicates a lease grant reply is expected and have the +timestamp in the control structure.  +Also I will probably add in a spare field or two for future use in the REP_CONTROL +structure.
+

Gap processing

+No matter which implementation we choose for ongoing lease refreshment, +gap processing must be considered.  The code above assumes the +timestamps will be placed on PERM records only.  Normal log +records will not have a timestamp, nor a flag or anything else like +that.  However, any log message can fill a gap on a client and +result in the processing of that normal log record to return DB_REP_ISPERM +because later records +were also processed.
+
+The current implementation should work fine in that case because when +we store the message in the client temp db we store both the control +DBT and the record DBT.  Therefore, when a normal record fills a +gap, the later PERM record, when retrieved will look just like it did +when it arrived.  The client will have access to the LSN, and the +timestamp, etc.  However, it does mean that sending the REP_LEASE_GRANT +message must take +place down in __rep_apply +because that is the only place we have access to the contents of those +stored records with the timestamps.
+
+There are two logical choices to consider for granting the lease when +processing an update.  As we process (either a live record or one +read from the temp db after filling a gap) a PERM message, we send the REP_LEASE_GRANT +message for each +PERM record we successfully apply.  Or, second, we keep track of +the largest timestamp of all PERM records we've processed and at the +end of the function after we've applied all records, we send back a +single lease grant with the max_perm_lsn +and a new max_lease_timestamp +value to the master.  The first is easier to implement, the second +results in possibly slightly fewer messages at the expense of more +bookkeeping on the client.
+
+A third, more complicated option would be to have the message timestamp +on all records, but grants are only sent on the PERM messages.  A +reason to do this is that the later timestamp of a normal log record +would be used as the timestamp sent in the reply and the master would +get a more up to date timestamp value and a longer lease. 
+
+If we change the REP_CONTROL +structure to include the timestamp, we potentially break or at least +need to revisit the gap processing algorithm.  That code assumes +that the control and record elements for the same LSN look the same +each and every time.  The code stores the control DBT as the key and the rec DBT as the data.  We use a +specialized compare function to sort based on the LSN in the control +DBT.  With master leases, the same record transmitted by a master +multiple times or client for the same LSN will be different because the +timestamp field will not be the same.  Therefore, the client will +end up with duplicate entries in the temp database for the same +LSN.  Both solutions (adding the timestamp to REP_CONTROL and adding a REPCTL_LEASE flag) can yield +duplicate entries.  The flag would cause the same record from the +master and client to be different as well.
+

Handling Incoming Lease Grants
+

+The third piece of lease management is handling the incoming REP_LEASE_GRANT +message on the +master.  When this message is received, the master must do the +following:
+
REP_SYSTEM_LOCK
msg_timestamp = cntrl->timestamp;
client_lease = __rep_lease_entry(dbenv, client eid)
if (client_lease == NULL)
initial lease for this site, DB_ASSERT there is space in the table
add this to the table if there is space
} else
compare msg_timestamp with client_lease->start_time
if (msg_timestamp is more recent && msg_lsn >= lease LSN)
update entry in table
REP_SYSTEM_UNLOCK
+

Expiring Leases

+Leases can expire in two ways.  First they can expire naturally +due to the passage of time.  When checking leases, if the current +time is later than the lease entry's end_time +then the lease is expired.  Second, they can be forced with a +premature expiration when the application's transport function returns +an error.  In the first case, there is nothing to do, in the +second case we need to manipulate the end_time +so that all future lease checks fail.  Since the lease start_time +is guaranteed to not be in the future we will have a function __rep_lease_expire +that will:
+
REP_SYSTEM_LOCK
for each entry in the lease table
entry->end_time = entry->start_time;
REP_SYSTEM_UNLOCK
+Is there a potential race or problem with prematurely expiring +leases?  Consider an application that enforces an ALL +acknowledgement policy for PERM records in its transport +callback.  There are four clients and three send the PERM ack to +the application.  The callback returns an error to the master DB +code.  The DB code will now prematurely expire its leases.  +However, at approximately the same time the three clients are also +sending their REP_LEASE_GRANT +messages to the master.  There is a race between the master +processing those messages and the thread handling the callback failure +expiring the table.  This is only an issue if the messages arrive +after the table has been expired.
+
+Let's assume all three clients send their grants after the master +expires the table.  If we accept those grants and then a read +occurs the read will succeed since the master has a majority of leases +even though the callback failed earlier.  Is that a problem?  +The lease code is using a majority and the application policy is using +something other value.  It feels like this should be okay since +the data is held by leases on a majority.  Should we consider +having the lease checking threshold be the same as the permanent ack +policy?  That is difficult because Base API users implement +whatever they want and DB does not know what it is.
+

Checking Leases

+When a read operation on the master completes, the last thing we need +to do is verify the master leases.  We've already discussed +refreshing them when they are expired above.  We need two things +for a lease to be valid.  It must be within the timeframe of the +lease grant and the lease must be valid for the last PERM record +LSN.  Here is the logic +for checking the validity of leases in __rep_lease_check:
+
#define MAX_REFRESH_TRIES	3
DB_LSN lease_lsn;
REP_LEASE_ENTRY *entry;
u_int32_t min_leases, valid_leases;
db_timespec cur_time;
int ret, tries;

tries = 0;
retry:
ret = 0;
LOG_SYSTEM_LOCK
lease_lsn = lp->lsn
LOG_SYSTEM_UNLOCK
REP_SYSTEM_LOCK
min_leases = rep->nsites / 2;
__os_gettime(dbenv, &cur_time);
for (entry = head of table, valid_leases = 0; entry != NULL && valid_leases < min_leases; entry++)
if (timespec_cmp(&entry->end_time, &cur_time) >= 0 && log_compare(&entry->lsn, lease_lsn) == 0)
valid_leases++;
REP_SYSTEM_UNLOCK
if (valid_leases < min_leases) {
ret =__rep_lease_refresh(dbenv, ...);
/*
* If we are successful, we need to recheck the leases because
* the lease grant messages may have raced with the PERM
* acknowledgement. Give those messages a chance to arrive.
*/
if (ret == 0) {
if (tries <= MAX_REFRESH_TRIES) {
/*
* If we were successful sending, but not successful in racing the
* message thread, yield the processor so that message
* threads may have a chance to run.
*/
if (tries > 0)
/* __os_sleep instead?? */
__os_yield()
tries++;
goto retry;
} else
ret = DB_RET_LEASE_EXPIRED;
}
}
return (ret);
+If the master has enough valid leases it returns success.  If it +does not have enough, it attempts to refresh them.  This attempt +may fail if sending the PERM record does not receive sufficient +acks.  If we do receive sufficient acknowledgements we may still +find that scheduling of message threads means the master hasn't yet +processed the incoming REP_LEASE_GRANT +messages yet.  We will retry a couple times (possibly +parameterized) if the master discovers that situation. 
+

Elections

+When a client grants a lease to a master, it gives up the right to +participate in an election until that grant expires.  If we are +the master and dbenv->rep_elect +is called, it should return, no matter what, like it does today.  +If we are a client and rep_elect +is called special processing takes place when leases are in +effect.  First, the easy case is if the lease granted by this +client has already expired, then the client goes directly into the +election as normal.  If a valid lease grant is outstanding to a +master, this site cannot participate in an election until that grant +expires.  We have at least two options when a site calls the dbenv->rep_elect +API while +leases are in effect.
+
    +
  1. The simplest coding solution for DB would be simply to refuse to +participate in the election if this site has a current lease granted to +a master.  We would detect this situation and return EINVAL.  +This is correct behavior and trivial to implement.  The +disadvantage of this solution is that the application would then be +responsible for repeatedly attempting an election until the lease grant +expired.
    +
  2. +
  3. The more satisfying solution is for DB to wait the remaining time +for the grant.  If this client hears from the master during that +time the election does not take place and the call to rep_elect +returns with the +information for the current/old master.
  4. +
+

Election Code Changes

+The code changes to support leases in the election code are fairly +isolated.  First if leases are configured, we must verify the nsites +parameter is set to 0.  +Second, in __rep_elect_init +we must not overwrite the value of rep->nsites +for leases because it is controlled by the dbenv->rep_set_nsites +API.  +These changes are small and easy to understand.
+
+The more complicated code will be the client code when it has an +outstanding lease granted.  The client will wait for the current +lease grant to expire before proceeding with the election.  The +client will only do so if it does not hear from the master for the +remainder of the lease grant time.  If the client hears from the +master, it returns and does not begin participating in the +election.  A new election phase, REP_EPHASE0 +will exist so that the call to __rep_wait +can detect if a master responds.  The client, while waiting for +the lease grant to expire, will send a REP_MASTER_REQ +message so that the master will respond with a REP_NEWMASTER +message and thus, +allow the client to know the master exists.  However, it is also +desirable that if the master +replies to the client, the master wants the client to update its lease +grant. 
+
+Recall that the REP_NEWMASTER +message does not result in a lease grant from the client.  The +client responds when it processes a PERM record that has the REPCTL_LEASE +flag set in the message +with its lease grant up to the given LSN.  Therefore, we want the +client's REP_MASTER_REQ to +yield both the discovery of the existing master and have the master +refresh its leases.  The client will also use the REPCTL_LEASE +flag in its REP_MASTER_REQ message to the +master.  This flag will serve as the indicator to the master that +it needs to deal with leases and both send the REP_NEWMASTER +message and refresh +the lease.
+The code will work as follows:
+
if (leases_configured && (my_grant_still_valid || lease_never_granted) {
if (lease_never_granted)
wait_time = lease_timeout
else
wait_time = grant_expiration - current_time
F_SET(REP_F_EPHASE0);
__rep_send_message(..., REP_MASTER_REQ, ... REPCTL_LEASE);
ret = __rep_wait(..., REP_F_EPHASE0);
if (we found a master)
return
} /* if we don't return, fall out and proceed with election */
+On the master side, the code handling the REP_MASTER_REQ will +do:
+
if (I am master) {
...
__rep_send_message(REP_NEWMASTER...)
if (F_ISSET(rp, REPCTL_LEASE))
__rep_lease_refresh(...)
}
+Other minor implementation details are that __rep_elect_done +must also clear +the REP_F_EPHASE0 flag.  +We also, obviously, need to define REP_F_EPHASE0 +in the list of replication flags.  Note that the client's call to __rep_wait +will return upon +receiving the REP_NEWMASTER +message.  The client will independently refresh its lease when it +receives the log record from the master's call to refresh the lease.
+
+Again, similar to what I suggested above, the code could simply assume +global leases are configured, and instead of having the REPCTL_LEASE +flag at all, the master +assumes that it needs to refresh leases because it has them configured, +not because it is specified in the REP_MASTER_REQ +message it is processing. Right now I don't think every possible +REP_MASTER_REQ message should result in a lease grant request.
+

Elections and Quiescient Systems

+It is possible that a master is slow or the client is close to its +expiration time, or that the master is quiescient and all leases are +currently expired, but nothing much is going on anyway, yet some client +calls __rep_elect at that +time.  In the code above, we will not send the REP_MASTER_REQ +because the lease is +not valid.  The client will simply proceed directly to sending the +REP_VOTE1 message, throwing all +other clients into an election.  The master is still master and +should stay that way.  Currently in response to a vote message, a +master will broadcast out a REP_NEWMASTER +to assert its mastership.  That causes the election to +complete.  However, if desired the master may want to proactively +refresh its leases.  This situation indicates to me that the +master should choose to refresh leases based on configuration, not a +flag sent from the client.  I believe anytime the master asserts +its mastership via sending a REP_NEWMASTER +message that I need to add code to proactively refresh leases at that +time.
+

Other Implementation Details

+

Role Changes
+

+When a site changes its role via a call to rep_start in either +direction, we +must take action when leases are configured.  There are three +types of role changes that all need changes to deal with leases:
+
    +
  1. A master downgrading to a +client. When a master downgrades to a client, it can do so +immediately after it has proactively expired all existing leases it +holds.  This situation is similar to an error from the send +callback, and it effectively cancels all outstanding leases held on +this site.  Note that if this master expires its leases, it does +not have any effect on when the clients' lease grants expire on the +client side.  The clients must still wait their full expected +grant time.
    +
  2. +
  3. A client upgrading to master. +If a client is upgrading to a master but it has an outstanding lease +granted to another site, the code will return an EINVAL +error.  This situation +only arises if the application simply declares this site master.  +If a site wins an election then the election itself should have waited +long enough for the granted lease to expire and this state should not +arise then.
  4. +
  5. A client finding a new master. +When a client discovers a new and different master, via a REP_NEWMASTER +message then the +client cannot accept that new master until its current lease grant +expires.  This situation should only occur when a site declares +itself master without an election and that site's lease grant expires +before this client's grant expires.  However, it is possible +for this situation to arise +with elections also.  If we have 5 sites holding an election and 4 +of those sites have leases expire at about the same time T, and this +site's lease expires at time T+N and the election timeout is < N, +then those 4 sites may hold an election and elect a master without this +site's participation.  A client in this situation must call __rep_wait +with the time remaining +on its lease.  If the lease is expired after waiting the remaining +time, then the client can accept this new master.  If the lease +was refreshed during the waiting period then the client does not accept +this new master and returns.
    +
  6. +
+

DUPMASTER

+A duplicate master situation can occur if an old master becomes +disconnected from the rest of the group, that group elects a new master +and then the partition is resolved.  The requirement for master +leases is that this situation will not cause the newly elected, +rightful master to receive the DB_REP_DUPMASTER +return.  It is okay for the old master to get that return +value.  When a dual master situation exists, the following will +happen:
+ +

Client to Client Synchronization

+One question to ask is how lease grants interact with client-to-client +synchronization. The only answer is that they do not.  A client +that is sending log records to another client cannot request the +receiving client refresh its lease with the master.  That client +does not have a timestamp it can use for the master and clock skew +makes it meaningless between machines.  Therefore, sites that use +client-to-client synchronization will likely see more lease refreshment +during the read path and leases will be refreshed during live updates +only.  Of course, if a client supplies log records that fill a +gap, and the later log records stored came from the master in a live +update then the client will respond as per the discussion on Gap +Processing above.
+

Interaction Matrix

+If leases are granted (by a client) or held (by a master) what should +the following APIs and messages do?
+
+Other:
+log_archive: Leases do not affect log_archive.  OK.
+dbenv->close: OK.
+crash during lease grant and restart: Potential +problem here.  See discussion below.
+
+Rep Base API method:
+rep_elect: Already discussed above.  Must wait for lease to expire.
+rep_flush: Master only, OK - this will be the basis for refreshing +leases.
+rep_get_*: Not affected by leases.
+rep_process_message: Generally OK.  We'll discuss each message +below.
+rep_set_config: OK.
+rep_set_limit: OK
+rep_set_nsites: Must be called before rep_start +and nsites is immutable until +14778 is resolved.
+rep_set_priority: OK
+rep_set_timeout: OK.  Used to set lease timeout.
+rep_set_transport: OK.
+rep_start(MASTER): Role changes are discussed above.  Make sure +duplicate rep_start calls are no-ops for leases.
+rep_start(CLIENT): Role changes are discussed above.  Make sure +duplicate calls are no-ops for leases.
+rep_stat: OK.
+rep_sync: Should not be able to happen.  Client cannot accept new +master with outstanding lease grant.  Add DB_ASSERT here.
+
+REP_ALIVE: OK.
+REP_ALIVE_REQ: OK.
+REP_ALL_REQ: OK.
+REP_BULK_LOG: OK.  Clients check to send ACK.
+REP_BULK_PAGE: Should never process one with lease granted.  Add +DB_ASSERT.
+REP_DUPMASTER: Should never happen, this is what leases are supposed to +prevent.  See above.
+REP_LOG: OK.  Clients check to send ACK.
+REP_LOG_MORE: OK.  Clients check to send ACK.
+REP_LOG_REQ: OK.
+REP_MASTER_REQ: OK.
+REP_NEWCLIENT: OK.
+REP_NEWFILE: OK.  Clients check to send ACK.
+REP_NEWMASTER: See above.
+REP_NEWSITE: OK.
+REP_PAGE: OK.  Should never process one with lease granted.  +Add DB_ASSERT.
+REP_PAGE_FAIL:  OK.  Should never process one with lease +granted.  Add DB_ASSERT.
+REP_PAGE_MORE:  OK.  Should never process one with lease +granted.  Add DB_ASSERT.
+REP_PAGE_REQ: OK.
+REP_REREQUEST: OK.
+REP_UPDATE: OK.  Should never process one with lease +granted.  Add DB_ASSERT.
+REP_UPDATE_REQ: OK.  This is a master-only message.
+REP_VERIFY: OK.  Should never process one with lease +granted.  Add DB_ASSERT.
+REP_VERIFY_FAIL: OK.  Should never process one with lease +granted.  Add DB_ASSERT.
+REP_VERIFY_REQ: OK.
+REP_VOTE1: OK.  See Election discussion above.  It is +possible to receive one with a lease granted.  Client cannot send +one with an outstanding lease however.
+REP_VOTE2: OK.  See Election discussion above.  It is +possible to receive one with a lease granted.
+
+If the following method or message processing is in progress and a +client wants to grant a lease, what should it do?  Let's examine +what this means.  The client wanting to grant a lease simply means +it is responding to the receipt of a REP_LOG +(or its variants) message and applying a log record.  Therefore, +we need to consider a thread processing a log message racing with these +other actions.
+
+Other:
+log_archive: OK. 
+dbenv->close: User error.  User should not be closing the env +while other threads are using that handle.  Should have no effect +if a 2nd dbenv handle to same env is closed.
+
+Rep Base API method:
+rep_elect: See Election discussion above.  rep_elect +should wait and may grant +lease while election is in progress.
+rep_flush: Should not be called on client.
+rep_get_*: OK.
+rep_process_message: Generally OK.  See handling each message +below.
+rep_set_config: OK.
+rep_set_limit: OK.
+rep_set_nsites: Must be called before rep_start +until 14778 is resolved.
+rep_set_priority: OK.
+rep_set_timeout: OK.
+rep_set_transport: OK.
+rep_start(MASTER): OK, can't happen - already protect racing rep_start +and rep_process_message.
+rep_start(CLIENT): OK, can't happen - already protect racing rep_start +and rep_process_message.
+rep_stat: OK.
+rep_sync: Shouldn't happen because client cannot grant leases during +sync-up.  Incoming log message ignored.
+
+REP_ALIVE: OK.
+REP_ALIVE_REQ: OK.
+REP_ALL_REQ: OK.
+REP_BULK_LOG: OK.
+REP_BULK_PAGE: OK.  Incoming log message ignored during internal +init.
+REP_DUPMASTER: Shouldn't happen.  See DUPMASTER discussion above.
+REP_LOG: OK.
+REP_LOG_MORE: OK.
+REP_LOG_REQ: OK.
+REP_MASTER_REQ: OK.
+REP_NEWCLIENT: OK.
+REP_NEWFILE: OK.
+REP_NEWMASTER: See above.  If a client accepts a new master +because its lease grant expired, then that master sends a message +requesting the lease grant, this client will not process the log record +if it is in sync-up recovery, or it may after the master switch is +complete and the client doesn't need sync-up recovery.  Basically, +just uses existing log record processing/newmaster infrastructure.
+REP_NEWSITE: OK.
+REP_PAGE: OK.  Receiving a log record during internal init PAGE +phase should ignore log record.
+REP_PAGE_FAIL: OK.
+REP_PAGE_MORE: OK.
+REP_PAGE_REQ: OK.
+REP_REREQUEST: OK.
+REP_UPDATE: OK.  Receiving a log record during internal init +should ignore log record.
+REP_UPDATE_REQ: OK - master-only message.
+REP_VERIFY: OK.  Receiving a log record during verify phase +ignores log record.
+REP_VERIFY_FAIL: OK.
+REP_VERIFY_REQ: OK.
+REP_VOTE1: OK.  This client is processing someone else's vote when +the lease request comes in.  That is fine.  We protect our +own election and lease interaction in __rep_elect.
+REP_VOTE2: OK.
+

Crashing - Potential Problem
+

+It appears there is one area where we could have a problem.  I +believe that crashes can cause us to break our guarantee on durability, +authoritative reads and inability to elect duplicate masters.  +Consider this scenario:
+
    +
  1. A master and 4 clients are all up and running.
  2. +
  3. The master commits a txn and all 4 clients refresh their lease +grants at time T.
  4. +
  5. All 4 clients have the txn and log records in the cache.  +None are flushing to disk.
  6. +
  7. All 4 clients have responded to the PERM messages as well as +refreshed their lease with the master.
  8. +
  9. All 4 clients hit the same application coding error and crash +(machine/OS stays up).
  10. +
  11. Master authoritatively reads data in txn from step 2.
  12. +
  13. All 4 clients restart the application and run recovery, thus the +txn from step 2 is lost on all clients because it isn't any logs.
    +
  14. +
  15. A network partition happens and the master is alone on its side.
  16. +
  17. All 4 clients are on the other side and elect a new master.
  18. +
  19. Partition resolves itself and we have duplicate masters, where +the former master still holds all valid lease grants.
    +
  20. +
+Therefore, we have broken both guarantees.  In step 6 the data is +really not durable and we've given it to the user.  One can argue +that if this is an issue the application better be syncing somewhere if +they really want durability.  However, worse than that is that we +have a legitimate DUPMASTER situation in step 10 where both masters +hold valid leases.  The reason is that all lease knowledge is in +the shared memory and that is lost when the app restarts and runs +recovery.
+
+How can we solve this?  The obvious solution is (ugh, yet another) +durable BDB-owned file with some information in it, such as the current +lease expiration time so that rebooting after a crash leaves the +knowledge that the lease was granted.  However, writing and +syncing every lease grant on every client out to disk is far too +expensive.
+
+A second possible solution is to have clients wait a full lease timeout +before entering an election the first time. This solution solves the +DUPMASTER issue, but not the non-authoritative read.  This +solution naturally falls out of elections and leases really.  If a +client has never granted a lease, it should be considered as having to +wait a full lease timeout before entering an election.  +Applications already know that leases impact elections and this does +not seem so bad as it is only on the first election.
+
+Is it sufficient to document that the authoritative read is only as +authoritative as the durability guarantees they make on the sites that +indicate it is permanent? Yes, I believe this is sufficient.  If +the application says it is permanent and it really isn't, then the +application is at fault.  Believing the application when it +indicates with the PERM response that it is permanent avoids the +authoritative problem. 
+

Upgrade/Mixed Versions

+Clearly leases cannot be used with mixed version sites since masters +running older releases will not have any knowledge of lease +support.  What considerations are needed in the lease code for +mixed versions?
+
+First if the REP_CONTROL +structure changes, we need to maintain and use an old version of the +structure for talking to older clients and masters.  The +implementation of this would be similar to the way we manage for old REP_VOTE_INFO +structures.  +Second any new messages need translation table entries added.  +Third, if we are assuming global leases then clearly any mixed versions +cannot have leases configured, and leases cannot be used in mixed +version groups.  Maintaining two versions of the control structure +is not necessary if we choose a different style of implementation and +don't change the control structure.
+
+However, then how could an old application both run continuously, +upgrade to the new release and take advantage of leases without taking +down the entire application?  I believe it is possible for clients +to be configured for leases but be subject to the master regarding +leases, yet the master code can assume that if it has leases +configured, all client sites do as well.  In several places above +I suggested that a client could make a choice based on either a new REPCTL_LEASE +flag or simply having +leases turned on locally.  If we choose to use the flag, then we +can support leases with mixed versions.  The upgraded clients can +configure leases and they simply will not be granted until the old +master is upgraded and send PERM message with the flag indicating it +wants a lease grant.  The client will not grant a lease until such +time.  The clients, while having the leases configured, will not +grant a lease until told to do so and will simply have an expired +lease.  Then, when the old master finally upgrades, it too can +configure leases and suddenly all sites are using them.  I believe +this should work just fine and I will need to make sure a client's +granting of leases is only in response to the master asking for a +grant.  If the master never asks, then the client has them +configured, but doesn't grant them.
+

Testing

+Clearly any user-facing API changes will need the equivalent reflection +in the Tcl API for testing, under CONFIG_TEST.
+
+I am sure the list of tests will grow but off the top of my head:
+Basic test: have N sites all configure leases, run some,  read on +master, etc.
+Refresh test: Perform update on master, sleep until past expiration, +read on master and make sure leases are refreshed/read successful
+Error test: Test error conditions (reading on client with leases but no +ignore flag, calling after rep_start, etc)
+Read test: Test reading on both client and master both with and without +the IGNORE flag.  Test that data read with the ignore flag can be +rolled back.
+Dupmaster test: Force a DUPMASTER situation and verify that the newer +master cannot get DUPMASTER error.
+Election test: Call election while grant is outstanding and master +exists.
+Call election while grant is outstanding and master does not exist.
+Call election after expiration on quiescient system with master +existing.
+Run with a group where some members have leases configured and other do +not to make sure we get errors instead of dumping core.
+
+
+
+ + -- cgit v1.2.3