Skip to main content

Go Data Race Pop Quiz Analyzed

· 5 min read

Context

I recently wrote an article on data races. Today, I came across Dave Cheney's post, Wednesday pop quiz: spot the race, and I wanted to apply what I learned to analyze the example given.

The Program with a Data Race

package main

import (
"fmt"
"time"
)

type RPC struct {
result int
done chan struct{}
}

func (rpc *RPC) compute() {
time.Sleep(time.Second) // strenuous computation intensifies
rpc.result = 42
close(rpc.done)
}

func (RPC) version() int {
return 1 // never going to need to change this
}

func main() {
rpc := &RPC{done: make(chan struct{})}

go rpc.compute() // kick off computation in the background
version := rpc.version() // grab some other information while we're waiting
<-rpc.done // wait for computation to finish
result := rpc.result

fmt.Printf("RPC computation complete, result: %d, version: %d\n", result, version)
}

Investigation

Since the program is short, we can quickly identify that the suspicious part is the modification of the result field and the invocation of the version method. Let's dive deeper.

Write Operation

The compute method waits for a second before writing to the result field. It then closes the done channel to signal completion.

Read Operation

The statement result := rpc.result is a read operation, fetching the result value from the struct.

The version Method

The version method is interesting because it does not read or write to the struct. However, since it is a method with a value receiver, a copy of the struct is created when the method is called—this is crucial.

Goroutine and Channel

The go rpc.compute() statement launches the compute function in a separate goroutine, while <-rpc.done ensures the main thread waits until the done channel is closed.

Quick Answer Revealed

The data race occurs due to a conflict between the write operation (rpc.result = 42 in compute) and the read operation (struct copying when calling version). These two operations run in different goroutines and may interfere with each other.

  • The solution is simple: func (*RPC) version() int { (turn it into a pointer receiver).

The reason why result := rpc.result does not cause a conflict is that it occurs only after <-rpc.done, ensuring that the write in compute completes before the read in the main goroutine.

We can confirm the conflicting data by inspecting the logs.

Running go test data-race/race_test.go -race in my repo produces the following output:

==================
WARNING: DATA RACE
Write at 0x00c000028250 by goroutine 7:
command-line-arguments.(*RPC).compute()
/Users/yong/Documents/GitHub/learn-go/data-race/race_test.go:16 +0x44
...

Previous read at 0x00c000028250 by goroutine 6:
command-line-arguments.TestRace()
/Users/yong/Documents/GitHub/learn-go/data-race/race_test.go:28 +0x120
...

Goroutine 7 (running) created at:
command-line-arguments.TestRace()
/Users/yong/Documents/GitHub/learn-go/data-race/race_test.go:27 +0x114
...

Goroutine 6 (running) created at:
...
main.main()
_testmain.go:45 +0x110
==================
RPC computation complete, result: 42, version: 1
--- FAIL: TestRace (1.00s)
testing.go:1490: race detected during execution of test

As explained in my previous article, we can now interpret this log easily:

  • There is a write-read conflict.
  • The write occurs in compute(), modifying result.
  • The read occurs in race_test.go:28, the line invoking version().

What does GPT say?

I tested GPT by asking it why the data race occurred. It incorrectly pointed to result := rpc.result as the issue and suggested adding a mutex:

package race

import (
"fmt"
"sync"
"testing"
"time"
)

type RPCWithMutex struct {
result int
done chan struct{}
mu sync.Mutex
}

func (rpc *RPCWithMutex) compute() {
time.Sleep(time.Second) // strenuous computation intensifies
rpc.result = 42
close(rpc.done)
}

func (RPCWithMutex) version() int {
return 1 // never going to need to change this
}

func TestRaceRPCWithMutex(t *testing.T) {
rpc := &RPCWithMutex{done: make(chan struct{})}

go rpc.compute() // kick off computation in the background
version := rpc.version() // grab some other information while we're waiting
<-rpc.done // wait for computation to finish

rpc.mu.Lock()
result := rpc.result
rpc.mu.Unlock()

fmt.Printf("RPC computation complete, result: %d, version: %d\n", result, version)
}

While this approach is incorrect (it still results in the same race condition), an IDE may now issue a warning:

version passes lock by value: github.com/tlylt/learn-go/data-race.RPCWithMutex contains sync.Mutexcopylocksdefault

This warning suggests that version creates a copy of the struct, which includes the mutex—a bad practice.

Without Waiting on Done

If we want to trigger a data race between compute and result := rpc.result, we can remove the wait, allowing the read and write operations to potentially conflict:

package race

import (
"fmt"
"testing"
"time"
)

type RPCNoWait struct {
result int
done chan struct{}
}

func (rpc *RPCNoWait) compute() {
time.Sleep(time.Second) // strenuous computation intensifies
rpc.result = 42
close(rpc.done)
}

func (RPCNoWait) version() int {
return 1 // never going to need to change this
}

func TestRaceRPCNoWait(t *testing.T) {
rpc := &RPCNoWait{done: make(chan struct{})}

go rpc.compute() // kick off computation in the background
// version := rpc.version() // grab some other information while we're waiting
// <-rpc.done // wait for computation to finish

result := rpc.result

fmt.Printf("RPC computation complete, result: %d\n", result)
}

This produces a different log indicating a read-write conflict.

RPC computation complete, result: 0
PASS
==================
WARNING: DATA RACE
Write at 0x00c000118210 by goroutine 7:
command-line-arguments.(*RPCNoWait).compute()
/Users/yong/Documents/GitHub/learn-go/data-race/race3_test.go:16 +0x44
...

Previous read at 0x00c000118210 by goroutine 6:
command-line-arguments.TestRaceRPCNoWait()
/Users/yong/Documents/GitHub/learn-go/data-race/race3_test.go:31 +0x114
...
Goroutine 7 (running) created at:
command-line-arguments.TestRaceRPCNoWait()
/Users/yong/Documents/GitHub/learn-go/data-race/race3_test.go:27 +0x10c
...
Goroutine 6 (finished) created at:
testing.(*T).Run()
...
main.main()
_testmain.go:45 +0x110
==================
Found 1 data race(s)

This confirms that removing <-rpc.done allows a race condition to occur between compute and the direct read of rpc.result.

References

Code Isn't Working - Checklist

· One min read

The context: you've made a change, but the code isn't behaving as expected. Perhaps you fixed a bug, yet the buggy behavior persists.

I tend to go through this list in order, but it doesn't have to be strictly followed. For example, taking a break (the last item) is often the most effective first step.

  • Review code changes: are there any syntax errors, parameter differences, or test setup issues?
  • Examine logs: are any indicative logs missing or unexpected?
  • Rerun impacted tests on the same or another machine: could this be a machine-dependent issue?
  • Compare logs from successful and failed runs: what differences stand out?
  • Check surrounding code: are any preconditions broken? Does execution order matter?
  • Take a break, sleep on it, or discuss it.

That's it.

Go WARNING DATA RACE despite atomic operations

· 6 min read

The sync/atomic Package

The sync/atomic package provides utilities for atomic operations, such as integer addition. It is sometimes used to guard mutable states when accessed by multiple goroutines. However, atomic operations, when used without care, do not prevent data races. It can be detected using the -race flag when running tests. Let's illustrate this with an example.

Example: The Book Struct

package book

type Book struct {
Title string
CurrentPage int32
}

func newBook(title string) *Book {
return &Book{Title: title}
}

// OnPage returns the current page of the book
func (b *Book) OnPage() int32 {
return b.CurrentPage
}

// TurnPage will increment the CurrentPage by 1
func (b *Book) TurnPage() {
b.CurrentPage++
}

// SkipFirstPage will increment the CurrentPage by 2 if the CurrentPage is 0
func (b *Book) SkipFirstPage() {
if b.CurrentPage == 0 {
b.CurrentPage+=2
}
}

The Book struct contains CurrentPage, which tracks the current page. The OnPage method reads the state, while TurnPage and SkipFirstPage modify it.

When accessed concurrently by multiple goroutines, a data race can occur.

Example: Failing Test

import (
"fmt"
"sync"
"testing"
)

func TestRaceConditionBookNoAtomicNoLock(t *testing.T) {
book := newBook("Atomic Habits")

wg := sync.WaitGroup{}
wg.Add(1)

go func() {
defer wg.Done()
book.TurnPage()
}()

fmt.Printf("Current page: %d\n", book.OnPage())
wg.Wait()

if book.OnPage() != 1 {
t.Error("Expected page to be 1, Actual page is ", book.OnPage())
}
}

Running the test with go test -race will produce a race condition warning:

--- FAIL: TestRaceConditionBookNoAtomicNoLock (0.00s)
testing.go:1490: race detected during execution of test
FAIL
exit status 1

Understanding the Data Race Warning

Let's understand the key parts of a data race warning:

  • WARNING: DATA RACE
    • There is a conflicting read-write operation (can also be write-read)
  • Conflicting read/write operations:
    • Write at 0x... by goroutine X
    • Previous read at 0x... by goroutine Y
      • 0x... is the troubled memory location
      • Location of the troubled code is typically revealed here
  • Stack traces of both goroutines
  • Name of the failed test due to data race

The full log:

==================
WARNING: DATA RACE
Write at 0x00c000010670 by goroutine 8:
github.com/tlylt/learn-go/book.(*Book).TurnPage()
/Users/yong/Documents/GitHub/learn-go/book/book.go:37 +0x98
github.com/tlylt/learn-go/book.TestRaceConditionBookNoAtomicNoLock.func1()
/Users/yong/Documents/GitHub/learn-go/book/book_test.go:27 +0x80

Previous read at 0x00c000010670 by goroutine 7:
github.com/tlylt/learn-go/book.(*Book).OnPage()
/Users/yong/Documents/GitHub/learn-go/book/book.go:21 +0x17c
github.com/tlylt/learn-go/book.TestRaceConditionBookNoAtomicNoLock()
/Users/yong/Documents/GitHub/learn-go/book/book_test.go:30 +0x188
testing.tRunner()
/usr/local/go/src/testing/testing.go:1792 +0x180
testing.(*T).Run.gowrap1()
/usr/local/go/src/testing/testing.go:1851 +0x40

Goroutine 8 (running) created at:
github.com/tlylt/learn-go/book.TestRaceConditionBookNoAtomicNoLock()
/Users/yong/Documents/GitHub/learn-go/book/book_test.go:25 +0x16c
testing.tRunner()
/usr/local/go/src/testing/testing.go:1792 +0x180
testing.(*T).Run.gowrap1()
/usr/local/go/src/testing/testing.go:1851 +0x40

Goroutine 7 (running) created at:
testing.(*T).Run()
/usr/local/go/src/testing/testing.go:1851 +0x684
testing.runTests.func1()
/usr/local/go/src/testing/testing.go:2279 +0x7c
testing.tRunner()
/usr/local/go/src/testing/testing.go:1792 +0x180
testing.runTests()
/usr/local/go/src/testing/testing.go:2277 +0x77c
testing.(*M).Run()
/usr/local/go/src/testing/testing.go:2142 +0xb68
main.main()
_testmain.go:51 +0x110
==================
--- FAIL: TestRaceConditionBookNoAtomicNoLock (0.00s)
testing.go:1490: race detected during execution of test
FAIL
exit status 1

Let's interpret the test failure.

The example test is failing because the CurrentPage state at 0x00c000010670 is being concurrently operated by the write from TurnPage() and the read from OnPage().

Fixing the Data Race with Atomic Operations

Atomic operations ensure safe concurrent access:

func (b *Book) OnPageAtomic() int32 {
return atomic.LoadInt32(&b.CurrentPage)
}

func (b *Book) TurnPageAtomic() {
atomic.AddInt32(&b.CurrentPage, 1)
}

Issues with Conditional Operations

When operations conditionally modify a mutable state, they can become a source of errors if atomic operations are not used consistently. For example, consider a function designed to skip the first page only if the book hasn't been read. Here is an incorrect implementation:

func (b *Book) SkipFirstPage() {
if b.CurrentPage == 0 {
atomic.AddInt32(&b.CurrentPage, 2)
}
}

The conditional check on b.CurrentPage is not atomic, leading to a race condition.

The test:

func TestRaceConditionBookWithAtomic(t *testing.T) {
book := newBook("Atomic Habits")

wg := sync.WaitGroup{}
wg.Add(1)

go func() {
defer wg.Done()
book.SkipFirstPage()
fmt.Printf("After SkipFirstPage: page: %d\n", book.OnPageAtomic())
}()

book.TurnPageAtomic()
fmt.Printf("After TurnPageAtomic: page: %d\n", book.OnPageAtomic())
fmt.Println(book.OnPageAtomic())

wg.Wait()
fmt.Printf("End of test: page: %d\n", book.OnPageAtomic())
if book.OnPageAtomic() != 1 {
t.Error("Expected page to be 1, Actual page is ", book.OnPageAtomic())
}
}

A race condition warning will be shown. In summary, a conflict occurred:

  • Read at 0x00c00019a040 by goroutine 11: github.com/tlylt/learn-go/book.(*Book).SkipFirstPage()
  • Previous write at 0x00c00019a040 by goroutine 10: sync/atomic.AddInt32()
  • The SkipFirstPage method has a if b.CurrentPage == 0 { that read the mutable stage without an atomic guard.

To fix it, we can also load it atomically.

func (b *Book) SkipFirstPageAtomic() {
if atomic.LoadInt32(&b.CurrentPage) == 0 {
atomic.AddInt32(&b.CurrentPage, 2)
}
}

Note that while the above does fix the data race, it is not necessarily logically sound. This is because the read and update are done as two atomic operations, meaning the state could still be mutated after the load and right before the addition. If so, the result may become unexpected. To make both operations atomic, a mutex/lock could be used to ensure that no other goroutines can access the state while executing the function.

Flakiness Despite Atomic Operations

Even with atomic operations, test behavior can be unpredictable due to scheduling differences. Adding a delay introduces flakiness:

  • time.Sleep(1 * time.Microsecond) is added
func TestRaceConditionBookWithAtomic(t *testing.T) {
book := newBook("Atomic Habits")

wg := sync.WaitGroup{}
wg.Add(1)

go func() {
defer wg.Done()
book.SkipFirstPage()
fmt.Printf("After SkipFirstPage: page: %d\n", book.OnPageAtomic())
}()

time.Sleep(1 * time.Microsecond)
book.TurnPageAtomic()
fmt.Printf("After TurnPageAtomic: page: %d\n", book.OnPageAtomic())
fmt.Println(book.OnPageAtomic())

wg.Wait()
fmt.Printf("End of test: page: %d\n", book.OnPageAtomic())
if book.OnPageAtomic() != 1 {
t.Error("Expected page to be 1, Actual page is ", book.OnPageAtomic())
}
}

With time.sleep before TurnPageAtomic, it is now difficult to determine which operation (SkipFirstPage or TurnPageAtomic) will execute first. Additionally, since SkipFirstPage will only mutate the page number if it is 0, the result can be either 1 or 3. The test may randomly pass or fail because the execution order is not guaranteed.

Using Locks Instead of Atomic Operations

Locks can also prevent data races:

package book

import (
"sync"
)

type Book struct {
Title string
CurrentPage int32
lock sync.RWMutex
}

func (b *Book) OnPageLock() int32 {
b.lock.RLock()
defer b.lock.RUnlock()
return b.CurrentPage
}

func (b *Book) TurnPageLock() {
b.lock.Lock()
defer b.lock.Unlock()
b.CurrentPage++
}

func (b *Book) SkipFirstPageLock() {
b.lock.Lock()
defer b.lock.Unlock()
if b.CurrentPage == 0 {
b.CurrentPage += 2
}
}

However, using a read lock incorrectly can still introduce data races:

func (b *Book) SkipFirstPageLock() {
b.lock.RLock()
defer b.lock.RUnlock()
if b.CurrentPage == 0 {
b.CurrentPage += 2
}
}

Here, a read lock is used, but the function modifies CurrentPage, leading to a data race.

Summary

  1. Use atomic operations carefully, ensuring all reads and writes are guarded.
  2. Even with atomic operations, test flakiness can occur due to goroutine scheduling.
  3. Be cautious with locks; incorrect usage can still introduce race conditions.

Reference

The code discussed is available in my repository: