This page contains a step by step explanation of a simple but fast C++ implementation of the segmented sieve of Eratosthenes that generates the primes below n using operations and space. It runs significantly faster than a traditional sieve of Eratosthenes implementation due to its more efficient CPU cache usage i.e. it uses the CPU's L1 data cache size as its sieve array size. This ensures that less than 1 percent of the memory accesses will be cache misses. This implementation generates the primes below 10^9 in just 0.8 seconds (single-threaded) on an Intel Core i7-6700 3.4 GHz CPU.
We set the size of the sieve array named segment_size to be the same size as the CPU's L1 data cache size per core e.g. 32768 bytes. The segment size must not be smaller than the square root of the sieving limit otherwise the run-time complexity of the algorithm deteriorates.
We sieve one segment in each iteration of the outer sieving loop. At the beginning of each iteration we reset the sieve array and set the lower and upper bounds of the current segment. Hence the current segment corresponds to the interval [low, high].
Before we can start sieving the current segment [low, high] we need to generate the sieving primes ≤ sqrt(high). We do this using a simple sieve of Eratosthenes implementation. When we find a new sieving prime we store it in the primes array and we also store the index of the first multiple that needs to be crossed off (in the segmented sieve) in the multiples array.
After having generated the sieving primes we can proceed with the actual prime sieving. We start crossing off the multiples of the first sieving prime 3 until we reach a multiple of 3 ≥ segment_size, when this happens we calculate the index of that multiple in the next segment using (multiple - segment_size) and store it in the multiples array. We then cross off the multiples of the other sieving primes using the same procedure.
Once we have crossed off the multiples of all sieving primes in the current segment we iterate over the sieve array and count (or print) the primes.
Download segmented_sieve.cpp, open a terminal and run:
The segmented sieve of Eratosthenes implementation presented before is both simple and fast. But what if we wanted to further optimize it so that it runs even faster? Well the next step is to change our implementation so that it uses a bit array instead of a byte array for sieving.
The idea is to map numbers to bits in the sieve array e.g. we define that each bit in the sieve array corresponds to an odd number. Hence bit0 = 1, bit1 = 3, bit2 = 5, bit3 = 7 and so forth. This also means that each byte of the sieve array corresponds to a interval, as each byte has 8 bits byte0 = [0, 15], byte1 = [16, 31], byte2 = [32, 47] and so forth.
This loop is very similar to the segmented byte sieving loop. But here when we cross off a multiple we unset the corresponding bit instead of setting the corresponding byte to false. We divide the multiple by 16 (right shift by 4) to find the corresponding byte and we use the unset_bit lookup table to mask off the bit corresponding to the multiple in the sieve array.
Each 1 bit that remains in our sieve array after the sieving step corresponds to a prime. Instead of iterating over each bit and checking whether it is set or not we will use a precomputed popcnt lookup table that tells us how many bits are set in each byte, this is much faster.
We also need to unset the bits in the sieve array that correspond to numbers > sieving limit. These are the most significant bits in the last byte of the sieve array.
Download segmented_bit_sieve.cpp, open a terminal and run:
The performance of the segmented sieve of Eratosthenes mainly depends on CPU caching. If the sieve array fits into the CPU's L1 or L2 cache sieving will be extremely fast, if however the sieve array is larger than the CPU's cache size the sieving performance will deteriorate by up to 2 orders of magnitude. As the size of the sieve array is set to the square root of the sieving limit the sieve array will not fit into the CPU's cache if the sieving limit is sufficiently large e.g. 10^10.
By using a bit array for sieving the sieve array uses less memory and hence it will fit into the CPU's caches even for larger sieving limits. Actually the bit sieving array will fit into the CPU's L1 cache up to ~ 10^11.4 and into the CPU's L2 cache up to ~ 10^13.8.
Segmented byte sieve
Intel Core i7-6700, 3.4GHz, 32 KiB L1 cache
Segmented bit sieve
Intel Core i7-6700, 3.4GHz, 32 KiB L1 cache