#### **Transactional Memory**



Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

#### Moore's Law



#### Moore's Law (in practice)





#### Nearly Extinct: the Uniprocesor





#### Endangered: The Shared Memory Multiprocessor (SMP)





#### The New Boss: The Multicore Processor (CMP)

# All on the same chip



Sun T2000 Niagara



#### **Traditional Scaling Process**



#### **Ideal Scaling Process**









# 1-thread execution time Speedup= *n*-thread execution time



1/1 + p + p/n

#### Speedup=



# **Speedup=** parallel fraction 1/1+p+p/n









# Bad synchronization ruins everything

Amdal's Law

You buy a 10-core machine ...

Your application is:

60% concurrent

40% sequential





Your application is:

60% concurrent 1/1-0.6-0.6/10 = 2



You buy a 10-core machine ...

Your application is:

80% concurrent

20% sequential





Your application is:

80% concurrent 20% sequential How close to a 10-fold speedup?



You buy a 10-core machine ...

Your application is:

90% concurrent

10% sequential



You buy a 10-core machine ...

Your application is:

80% concurrent 20% sequential 1/1 - 0.9 - 0.9/10 = 5



You buy a 10-core machine ...

Your application is:

99% concurrent

01% sequential





Your application is:





This course is about the parts that are hard to make concurrent ... but still have a big influence on speedup!





#### **Coarse-Grained Locking**



26

#### **Fine-Grained Locking**





#### Locks are not Robust



#### Locking Relies on Conventions

Relation between ...

Lock data and object data

#### Exists only in programmer's

#### Actual comment from Linux Kernel

(hat tip: Bradley Kuszmaul)

/\*

\* When a locked buffer is visible to the I/O layer \* BH\_Launder is set. This means before unlocking \* we must clear BH\_Launder,mb() on alpha and then \* clear BH\_Lock, so no reader can see BH\_Launder set \* on an unlocked buffer and then risk to deadlock. \*/



#### Simple Problems are hard



# Locks Not Composable





## Locks Not Composable





# Locks Not Composable





# Monitor Wait and Signal





Programming

#### Wait and Signal do not Compose



# The Transactional Manifesto

Much modern programming practice inadequate for multicore world



Agenda

Replace locking with a transactional API

**Design languages and libraries** 




## Road Map

**Transactional Memory** 

Hardware Transactional Memory

Hybrid Transactional Memory

**Software Transactional Memory** 

Research Questions



## Road Map

**Transactional Memory** 

Hardware Transactional Memory

Hybrid Transactional Memory

Software Transactional Memory

Research Questions



### Transactions

Block of code .... Atomic: appears to happen instantaneously Serializable: all appear to happen in one-at-a-time Commit: takes effect (atomically) Abort: has no effect (typically restarted)



### **Atomic Blocks**

```
atomic {
  x.remove(3);
  y.add(3);
}
atomic {
  y = null;
```





### **Atomic Blocks**





#### A Double-Ended Queue





#### A Double-Ended Queue

```
public void LeftEnq(item x)
atomic {
   Qnode q = new Qnode(x);
   q.left = left;
   left.right = q;
   left = q;
}
```



#### A Double-Ended Queue



#### **Enclose in atomic block**



# Warning







### Composition?









Art of Multiprocessor Programming

### Composition?





### **Conditional Waiting**





#### **Composable Conditional Waiting**





# Road Map

**Transactional Memory** 

Hardware Transactional Memory

Hybrid Transactional Memory

Software Transactional Memory

Research Questions





### Standard Cache Coherence





Art of Multiprocessor Programming

### Standard Cache Coherence





## Standard Cache Coherence























### Modify Cached Data









### Invalidate







## Invalidate

























### **Transaction Commit**

At Commit point ...

No cache conflicts? We win.

Mark transactional cache entries ...

Was: read-only, Now: valid

Was: modified, Now: dirty (will be written back)

That's (almost) everything!



# Road Map

**Transactional Memory** 

Hardware Transactional Memory

Hybrid Transactional Memory

Software Transactional Memory

Research Questions



#### Hardware Transactional Memory (HTM)

IBM's Blue Gene/Q & System Z & Power8

Intel's Haswell TSX extensions












If you see this, you are inside a transaction









you could retry the transaction, or take an alternative path



























#### Too Slow





#### Just Not in the Mood



### Hybrid Transactional Memory





# **Non-Speculative Fallback**

```
if (_xbegin() == _XBEGIN_STARTED) {
   read lock state
   if (lock taken) _xabort();
   work;
   _xend()
} else {
   lock->lock();
   work;
   lock->unlock();
}
```



# Non-Speculative Fallback





# Non-Speculative Fallback









<HLE acquire prefix> lock();

do work;

<HLE release prefix> unlock()











# **Conventional Locks**













Art of Multiprocessor Programming













# Removing a Node



# Removing a Node



# $\begin{array}{c} \hline \end{array} \end{array} \xrightarrow{} a \xrightarrow{} b \xrightarrow{} c \xrightarrow{} d \xrightarrow{} + b \xrightarrow{$



















#### no locks acquired







### How Far to Teleport?

Too short?

**Missed opportunity** 

Too far?

Transaction aborts, work lost



### Adaptive Teleportion

**On Success:** 

limit = limit + 1

**On Failure:** 

limit = limit / 2


```
Node* teleport(Node* start, T v) {
int retries = RETRY THRESHOLD;
while (--retries) {
  int distance = 0;
  if (xbegin() == XBEGIN STARTED) {
    traverse up to teleportLimit nodes
    move lock
    xend();
    teleportLimit++;
    return pred;
  } else {
    teleportLimit = teleportLimit/2
  }};
```













































# Lock-Based STMs

**STMs come in different forms:** 







# Lock-Based STM

But, didn't you just say that locks are evil?

For applications, yes!

For run-time systems written by experts, maybe not ....



# Lock-Based STMs

Each transaction keeps

**Read Set: locations and values read** 

Write Set: locations and values written

**Changes installed at commit** 

**Conflicts detected at comit** 















RT CTSSOR MINC







































# Version Clock



#### Transactin 11:00 11:00 a 10:30 b R 09:00 С d Version numbers not e really timestamps, but useful to pretend 141

### Transactions
















## Road Map

**Transactional Memory** 

Hardware Transactional Memory

Hybrid Transactional Memory

Software Transactional Memory

Research Questions



## **TM Design Issues**

- Implementation choices
- Language design
   issues









Art of Multiprocessor Programming

## Granularity

- Object
  - managed languages, Java, C#, ...
  - Easy to control interactions between transactional & non-trans threads
- Word
  - C, C++, ...
  - Hard to control interactions between transactional & non-trans threads



## Direct/Deferred Update

- Deferred
  - modify private copies & install on commit
  - Commit requires work
  - Consistency easier
- Direct
  - Modify in place, roll back on abort
  - Makes commit efficient
  - Consistency harder



## **Conflict Detection**

- Eager
  - Detect before conflict arises
  - "Contention manager" module resolves
- Lazy
  - Detect on commit/abort
- Mixed
  - Eager write/write, lazy read/write ...



## **Conflict Detection**

- Eager detection may abort transactions that could have committed.
- Lazy detection discards more computation.



# Contention Management & Scheduling

- How to resolve conflicts?
- Who moves forward and who rolls back?
- Lots of empirical work but formal work in infancy





## **Contention Manager Strategies**

- Exponential backoff
- Priority to
  - Oldest?
  - Most work?
  - Non-waiting?
- None Dominates
- But needed anyway



Judgment of Solomon



## I/O & System Calls?

- Some I/O revocable
  - Provide transactionsafe libraries
  - Undoable file system/DB calls
- Some not
  - Opening cash drawer
  - Firing missile





## I/O & System Calls

- One solution: make transaction irrevocable
  - If transaction tries I/O, switch to irrevocable mode.
- There can be only one ...
   Requires serial execution
- No explicit aborts
  - In irrevocable transactions





## Exceptions



int i = 0;
try {
 atomic {
 i++;
 node = new Node();
 }
} catch (Exception e) {
 print(i);
}











## **Unhandled Exceptions**

- Aborts transaction
  - Preserves invariants
  - Safer
- Commits transaction
  - Like locking semantics



– What if exception object refers to values modified in transaction?



## **Nested Transactions**

atomic void foo() {
 bar();
}

atomic void bar() {

•••





## **Nested Transactions**

- Needed for modularity
  - Who knew that cosine() contained a transaction?
- Flat nesting
  - If child aborts, so does parent
- First-class nesting
  - If child aborts, partial rollback of child only



# LOCKS

Hardware Transactions and Locks, Together at Last\* Brown University elias\_wald@br~~ Maurice Herlihy Computer Science Department Brown University mph@cs.brown.edu

in this simple

2014

Using Hardware Transactional Memory to Correct and Simplify a Readers-Writer Lock Algorithm Yossi Lev Mark Moir Dave Dice

wh.mair)&grade.com

<sup>o</sup> locks and hardware transactions. Many. endently synchronized nodes, each protrying to access the same node at the er by lock coupling: a thread hold

### Locks and transactions complement on another

Abstract Dending correct conclusion and a second densities of proceeding fields as endersited by a bits we have seen experimentation operation of the second density of the second second of the second for nearly the decides. We use build near the second second second in the second second or second or second and protection of the second second operation of the second second second second provide the build second or second second second second second response on the decides of the HTML second second second second provide second second or second second second second second response on the second second second second second second second response on the second second second second second second second response on the second response on the second second

Victor Luchangeo

NEWER GRANTING IN THE BIRLY AND A REAL PROVIDED AND VALUES, we decided to setter to evaluate the attention of the SERR depending to earlier to evaluate the attention of the implementation of the attention of the real of the real least anti-related in the outstand device.

to improve and viewphily on algorithm even if the two of NTM too out been atticipants in the outgoard decign. The two was and the addressed by using Transactional Lack Biblion (TLE) (DTD), where it is the addressed by a trans-control vision between a thread that even and the even of the transaction metaneously with a preserve concerness, with a the transaction metaneously with the preserve concerness, with control a transaction sequence s hosts and then exercise the control of the transaction contracted gravity in preserve exercises, when the host intermediation contracted has control there control controls when the host of its transition conference/positive. To preserve supported, yes transitions are trackled while this control, when the both is the same constraint around a read of the same superinterial ranadices are tradeled under they carried control, when the total ranadices are tradeled under the option TLE to use tradeled carried holds. It says write the order of the total the control of units of holds. It says write the order of the total of the content holds. It says write the order of the content holds. It says are the order of the content holds. It says are the order of the content of the content.

in substantially reduces another. oncurrent list based on lock vithm, and slightly outper-

# Memory



and a second distribution of the second Anna Postaniano, George, Laportaciono and Manage Concerned Geology Coller. Introduction

Reactional managery because the and a second to presidente, o pressenting

sing threads. Recent ctional aneavory to spec on memory reclamation sc whitewown techniques eit



### Exploiting Hardware Transactional Memory in Main-Memory Databases

Using Restricted Transactional Memory come acourace anotactions accurate to Build a Scalable In-Memory Database Dimogram Wome), Hero Quart, Kasyang Lit, Halton Chemit 2004594 Wahili, Hab Quint, Halsabili kali, Halmo Chelli, 2004594 Georgeon science halm University Solitoki na Dominent System, Standar Kali University Chelloni (Dependent of Company Subcer, Ser, York University Vencies, he conscious of the problem to make where and a (nonplas) manual ma Losel bas PORTION WITH SERVI NE NECH SYNDONY Al LAN HAWAI PROCESSON TOOLS ON SINDON A DUILT AC SUMMING DECONT BOD TOODS NE AND AND TO TOOLS AND A DUILT AND A DUILT AND A SINDON A DUILT AND AND A DUILT AND A DUILT AND A SINDON AND A DUILT AND A DUILT AND A DUILT AND A SINDON AND A DUILT AND A DUILT AND A DUILT AND A DUILT AND A SINDON AND A DUILT ware marks the (reg (16), This opens core universe, lostes The ment analytics of loss through process Neutran (2015) (2015 in 16 in 1999)(7) (2020)(6) (2021) marked representation menory (2020) (6) (2021) marked representation (2020)(2020) (2020) (2020) and shopic openation na toda's manifeste manual produces (BCN) to Print Pattername and produced to a ton tor a communi-patient. The main limitation is not star tor a communi-patient. especiality which Schere Web rectangene and Stark establishes (arone and/or see marking). The balls industrial (and 200 km) in present shift) of RDM is to compared waiting an intervent has shift of a code or entered waiting and the database in co-barred. The balls of thick addresses has database in co-Blably then more The Property is the interview and enter tion but made in builds for the build will wreak be read to be design of WAN addresses the statement in an build of the builds of WAN addresses were the new lower build of the build build a dealers were the own lower perindonia ponel Dice lottle as Proving in Practic icourts's millie

To less protocol and areas of the

epotential in characte from . Contract potential locals provide a statistical processing and a local due to contract pro-cessing out to near metamoreme under load due to contract pro-cessing out to near metamoreme under load due to contract pro-

staldofenedal programming melle (notes) esclative) he ca lad o par partamente valat hal de a contration to tables information, partamente contratis recon la o central or part potensianse valen hal dær in contration og redaen venenskrive, proteininen valende venense omensione Ter redaen of the residuel tasks and analysis omensione

TA 666. TOTAL DESCRIPTION TOTAL STREET, SALES TOTALES
 TOTALES STORE CORRECTIONED INTO THE SALES TOTALES

ndaes indecedure, programmer, contranti, reven in a Santon di Pan-palante, balai ned neunte operationi

REAL.

read size

Vo6 (0)

1001

100000

Abstract

on per woonlos a tore

witting min



Arbitract To the splice we denote a scrapp system output Violet the difficulty names free minute free system without a constraint free minute solutions (ask system). Scrapp Violet constraints with the system of the Notes of the system of the system of the Notes of the system of the system of the Notes of the system of the system of the Notes of the system of the system of the system of the Notes of the system of the system of the system of the Notes of the system of the system of the system of the Notes of the system of the system of the system of the Notes of the system of the system of the system of the Notes of the system of the system of the system of the Notes of the system of the system of the system of the system of the Notes of the system of the system of the system of the system of the Notes of the system of the system of the system of the system of the Notes of the system of the system of the system of the system of the Notes of the system of the system of the system of the system of the Notes of the system of the system of the system of the system of the Notes of the system of the Notes of the system NGASSINAL EN LASSA IN FAIL MAIL LE-PREASE BOOK NGASSING AND SERVICES OF DE GALERARY ANDREASE AND ANTRALISER DAL TANKER ANTRALISE AND ANTRALIS No dan senatari ya la (danaka dadalah se alar Maladalah da kalada da kapada kapada kala kapada Maladalah da kalada da kapada kapada kapada Alatibulina dal Interfat della della della della della della seg-policitate dene della estateven la granat della della della sola fotore: Videt ponibis efficienti nel suendi estatuare della manamenta fanciamente della della della della while Extension. Vision's processories beam devices in the second second state and the second sec

# Gauge / Datasets S. problems lata teleforma indea) la dal tackento diala (2014) taciali systemi. Fan danales application and anali-net dan defensione tack series and an angle dals fan defensione tackento dan assessment dal interpretation blanch den biomatic dan assessment. Danale da analizio element for queeks den stationals and elementer back for briefsh des stationals. Danate de esté fest resiste destations (DRAM, de des und and de la meneral destation station de la de destation from la meneral destation station de la de destation from E Alberts Los 2 of Endors to through Adjuster into utines do with Print Gran and, Roberts, Roment Minute? United

#### withtenay the pr not partormater. Thi-bund tracs NOTE: COMPANY waxy databout HIGHL, MIGHT

lagerse lag la Matter Dalabase ladar Parter addres vill Laur & Transactional Score land addres Externations

TJ 244 AC Decides Devicement Houses TJ Devices Devices Devicement Houses Transformer States Devices Devices

/ Mo-Memory Database Index Personaance w Transactional Systemation to Extensions

Kolgan Lelen

AND STATISTICS AND ADDRESS AND ADDRESS ADDRESS

### TM restructures in-memory databases

Roll Mu Book watchines and locations. Name of the Wind and the topological and an experiment to an experiment with intersection by consider application. Provid a conserva-response to their consider application of the experiment include transition typical devices. The experiment include transition typical devices and experiment include transition typical devices and experiment include transition typical devices. The event system is an include transition transition of the event system is and include transition transition of the event system is and include transition. Wate analysis is republiced (21). To denti synopsi tends tatistican nonzepene sull services visited que proteving nel protection et et unar cost of data proteving nel protection et et unar cost of second. In These replaced have only second blocked wave de these replaced block of second blocked wave de prinel scholad d ersion fox the shored to faile this

Di the balk constrained in standards with these to execute or entropy in standards with these to execute or entropy in standards and the scenario that only a scheme of the to-barry the scenario that only a scheme of the to-tario and instants due to entropy instrained and the ORIAN and Arabi via outcome. Uncomerce cannot beyond a scie and Arabi via outcomerce. Uncomerce cannot beyond a scie and Arabi via outcomerce. Uncomerce cannot beyond a scie and Arabi via outcomerce. Jandhino Indiato des Macandas Indiandas of UKAA and Julia valoratoria. Remestre, solitoj leterardo, nev reges valorizamente travito des non-tares uno estarto leterardo de Venenda Reservant. Un de valorizamente regelardo de Venenda Reservant. Un de valorizamente regelardo de valorizato con preferen en evaluante de observa-Telerer data selectariato con preferen en evaluante de observa-tariante de USI viene elevent en evaluante de observa-tariante de USI viene elevent en evaluante de observa-tariante de USI viene elevent en evaluante de observa-enter de la viene ben data senderala cat (inform impanan'i anastri ben data senderala cat (inform in a calanana datan miliata in (15) van when in a calanana cat ana miliata in (15) inter Yerran (informet, calananana



#### GPUs, etc. Hardware Transactional Memory for GPU Archited itware Transactional Memory for GPU A Ungi Inderpreet Singni Andrew Brownsword Department of Computer and Electrical Engineering il Iniversity of British Columbia Yunlong Xu\*, Rui Wang<sup>†</sup>, Nilanjan Goswami<sup>‡</sup>, Tao Li<sup>‡</sup>, Lan Y \*School of Electronic and \*School of Computer wwlfung@ece.ubc.ca isingh@ece.ubc.ca andrew@brownsword.ca aamodt@ece.ubc.ca ABSTRACT Tor M. Aai Graphics processo \*School of Computer Improvements in Hardware Transactional xjtu.ylxu@stu nil@uf Memory for GPU Architectures raction of their area to functional units rather than C, Applications that benefit from running on a G Alejandro Villegas, Angeles Navarro, Rafael Asenjo, and Oscar Plata STRACT odern GPUs have shown pr ating computation intensive viism coupled with regula ith limited dynamic data Ficient use of off-chip m eal-world applications manif sharing among concurrently e Dept. Computer Architecture a generic manycore acc University of Malaga, Andalucia Tech, 29071 Malaga al Memory for GPU UBC sharing requires mutual exclu V GPUs from NVIDIA Tor M. Aamodt integrity in multithreaded ( {avillegas, angeles, asenjo, oscar}@ac vning applications that GPUs provide atomic pr actional Me to construct fine-grained <sup>Synchronization</sup> is t Corp such as GPUs us requires signi er cost and energy tional correct GPUs and accelerators need synchronization larger problem execution p of GPU loc summing by in-University of British Col To make from GPU accelera cpu-Basemanifest [4]. Challeng the mult synchronization. Recent r Wilson W. L. Fung chitectures, where Control Flow Divergence: Transaction aborts may cause a warp to diverge · Value-Based Col Scalable Conflict Detection: 1000s of concurrent transactions • 1000 x 1000 parallel address-set comparison - too evidences Eliminate storage problem tional memory syst variety of Aborth d GPU Scalable Conflict Detection: 1000s of concurrent transactions 1000 x 1000 parallel address-set comparison - too expensive? 2 cache coherency protocol on GpU The major challe Control Flow Divergence: with respect to t chitectures, where a high h SIMT fashion, requires an enternory on GPU? to CPUs, GPUs offer t ronizapreventing livelo nux turu paranen acareas-aer compi - cache coherency protocol on GPU TX2 start checkintriof GPUs. To th + 1000s of threads is not cheap mit before TX2 start or cor ing to CPUs, GPUs offer two most tion technique valism on GPUs



neurous benerouning is inductions nearco programming a superior tenency connect to contenent and enti-lenters (HTM) addresses this proearly executing whiteasy read-us enty encourse services reaction Transactional Synchronization of processors to support acability of This there experiences a cospatette, which entrently sortal periodic scalability by supply scalubality with TSX, howe This thesis details how to necessary module along 1 This threads also down removes TLB shoulds due to concurrent of

These Supervisor: Tale: Professor C

Thesis Supervi Thile: Associa



Categories and Subject Descriptor

Hardware support for Local 3n nardware support for Local are Transactions on GPU Architectures

David Kal

Lawring and The answer of the alight OS Support for Virtuanzing Hardware Transactional Monory

but D. Rol Derid

Stinger AF Said Ity

# TM can simplify operating system kernels,

device drivers, security ...

NTRO

Annual Demo

Arrange Arrange (Second Second Second

## **Data Structures**







## Gartner Hype Cycle





## Transactions are Here to Stay

### Transactional Language Constructs for C++

| Authors:         | Hans Boehm, HP, hans,boehm@hp.com                         |
|------------------|-----------------------------------------------------------|
|                  | Justin Gottschlich, Intel, justin.e.gottschlich@intel.com |
|                  | Victor Luchangeo, Oracle, victor, luchangeo(d)oracle.com  |
|                  | Maged Michael, IBM, maged.michael@acm.org                 |
|                  | Mark Moir, Oracle, mark.moir@oracle.com                   |
|                  | Clark Nelson, Intel, clark.nelson@intel.com               |
|                  | Torvald Riegel, Red Hat, triegel@redhat.com               |
|                  | Tatiana Shpeisman, Intel, tatiana.shpeisman@intel.com     |
|                  | Michael Wong, IBM, michaelw@ca.ibm.com                    |
| Document number: | N3341=12-0031                                             |
| Date:            | 2012-01-11                                                |
| Project:         | Programming Language C++, Evolution Working Group         |

Michael Wong, IBM, michaelw@ca.ibm.com



#### Introduction

Reply-to: Revision:

> Intel<sup>®</sup> Architecture Instruction Set Extensions Programming Reference



## Спасибо!





Art of Multiprocessor Programming

