jaez

Could someone clarify what "one invalidation per lock release" means exactly in this case? Is it referring to the atomic increment on l->next_ticket or l->now_serving++ in unlock, or neither? Thanks.

czh

Lock release refers to unlock if I’m not mistaken

shaan0924

With this ticket implementation, is there now a sufficient provision for fairness? It seems like now processors can only hold one ticket when waiting to acquire a lock, and then once they are done they have to go to the "back of the line". Is this enough?

csmith23

Could you switch the != to a < comparison to implement a low-level semaphore?

ardenma

@jaez it's my understanding that "one invalidation per lock release" means that we only have to send one cache invalidation message each time we release the lock. Unfortunately there's no nice diagram for this case, but I think the the operations would go something like this: a processor 0 reads my_ticket and progresses past the Lock() function to do some work, and then calls unlock(). While processor 0 is doing work, the other processors could also have reached the lock() function, in which case they would have all read the value of my_ticket into their caches (my_ticket == 0 currently) using BusRd, and will just continually check the value of my_ticket. Then once processor 0 calls unlock(), it would have to BusRdX my_ticket so that it could update the value to 1, which would require 1 invalidation message sent across the bus to all the other caches (so O(P) total interconnect traffic on release). Then processor 1 acquires the lock and so on.

I think the key difference between this and something like the test-and-test-and-set (will just call it tts for short) lock, is that when the lock is released in tts, the processor holding the lock has to again BusRdX the lock variable to set its value to 0 (generating 1 invalidation message, generating O(P) interconnect traffic), but now every other processor will have to call test_and_set() which uses a BusRdX, meaning we would have to send 1 invalidation message per call to test_and_set(), generating O(P) traffic per processor, giving a total of O(P^2) traffic every unlock (if every processor has the lock cached).

Hopefully I got this mostly right, but if anyone sees a mistake in what I said please point it out.

tennis_player

@csmith23 I think the logic would be a little different. A semaphore allows for a maximum number of threads to acquire it at once. In this case, we are incrementing the next_ticket member potentially infinite times (not accounting for integer overflow). A semaphore implementation would have to increment and decrement its counter to maintain a specific count of threads.

Here’s another way to think about it. Say you want your semaphore counter to be at most 10. Then, imagine 20 threads call Lock() at the same time. If we wait while my_ticket < l->now_serving, we would still need to wait for all previous threads to call unlock(). Instead, we want to let 10 threads run at once and unblock a new thread as soon as one is finished.

Kecleon

to make another analogy, the spinning case would be as if: customers line up to check if there's any table empty, and they get one chance to check it and take the table.

beste

I am still confused as to why the ticket system is better than the original test-and-set. Is it because after we "take a ticket" we don't leave the lock function until our ticket gets serviced?

amagibaba

From my understanding,

In Ticket Lock: If a processor wants to lock(), it simply reads the lock.next_ticket into its cache, atomically increases it, and then doesn't care what happens to lock.next_ticket thereafter. This requires 1 PrWr/BusRdX (Invalid to Modified) transaction. Whichever processor that happens to have had lock.next_ticket in their cache will have to invalidate it (Modified to Shared / Invalid). This is O(P) in terms of total instructions ACROSS processors, but to the bus, it's just 1 BusRdX call: O(1). To each competing processor, it's also just an invalidation: O(1).

If a processor unlocks(), it simply reads the lock.now_serving into its cache, atomically increases, it, and then flushes it back to memory. This requires 1 PrWr/BusRdX (Invalid to Modified), followed by 1 BusWB (Modified to Shared / Invalid). While doing the BusRdX, it causes all other processors to invalidate their lock.now_serving (Shared to Invalidate). This is O(P) in terms of total instructions ACROSS processors, but to the bus, it's still just 1 BusRdX and 1 BusWB: O(1). To each competing processor, it's still just an invalidation: O(1).

If we're talking about latency, we're concerned about the delay for ANY ONE processor / bus. As seen, everything is O(1). Where does O(P) come from?

Similarly confused for the Test Test-And-Set system.

amagibaba

Sorry, in my lock() paragraph above, the processor increasing lock.next_ticket also has to flush lock.next_ticket back out to memory, so it has to also do a BusWB. But that makes it identical (in terms of transactions) to the unlock() scenario. Still O(1).

gsamp

@shaan0924, it is more fair in the sense that every processor will be at the front of the line at some point, and thus it will be given priority to progress. In the previous implementations, since all processors were competing on equal footing for the lock, a same processor could acquire the lock repeatedly, starving other processors. This may or may not be enough to be considered "fair", depending on the definition of fairness or the requirements for fairness of your application. If for some reason some processors should be given more or less priority, for example, this implementation would not suffice.