parallel processing – Hackaday

Import GPU: Python Programming with CUDA

Bryan Cockfield — Wed, 26 Feb 2025 03:00:30 +0000

Every few years or so, a development in computing results in a sea change and a need for specialized workers to take advantage of the new technology. Whether that’s COBOL in the 60s and 70s, HTML in the 90s, or SQL in the past decade or so, there’s always something new to learn in the computing world. The introduction of graphics processing units (GPUs) for general-purpose computing is perhaps the most important recent development for computing, and if you want to develop some new Python skills to take advantage of the modern technology take a look at this introduction to CUDA which allows developers to use Nvidia GPUs for general-purpose computing.

Of course CUDA is a proprietary platform and requires one of Nvidia’s supported graphics cards to run, but assuming that barrier to entry is met it’s not too much more effort to use it for non-graphics tasks. The guide takes a closer look at the open-source library PyTorch which allows a Python developer to quickly get up-to-speed with the features of CUDA that make it so appealing to researchers and developers in artificial intelligence, machine learning, big data, and other frontiers in computer science. The guide describes how threads are created, how they travel along within the GPU and work together with other threads, how memory can be managed both on the CPU and GPU, creating CUDA kernels, and managing everything else involved largely through the lens of Python.

Getting started with something like this is almost a requirement to stay relevant in the fast-paced realm of computer science, as machine learning has taken center stage with almost everything related to computers these days. It’s worth noting that strictly speaking, an Nvidia GPU is not required for GPU programming like this; AMD has a GPU computing platform called ROCm but despite it being open-source is still behind Nvidia in adoption rates and arguably in performance as well. Some other learning tools for GPU programming we’ve seen in the past include this puzzle-based tool which illustrates some of the specific problems GPUs excel at.

Turing Pi 2: The Low Power Cluster

Jonathan Bennett — Thu, 16 Jun 2022 17:00:43 +0000

We’re not in the habit of recommending Kickstarter projects here at Hackaday, but when prototype hardware shows up on our desk, we just can’t help but play with it and write it up for the readers. And that is exactly where we find ourselves with the Turing Pi 2. You may be familiar with the original Turing Pi, the carrier board that runs seven Raspberry Pi Compute boards at once. That one supports the Compute versions 1 and 3, but a new design was clearly needed for the Compute Module 4. Not content with just supporting the CM4, the developers at Turing Machines have designed a 4-slot carrier board based on the NVIDIA Jetson pinout. The entire line of Jetson devices are supported, and a simple adapter makes the CM4 work. There’s even a brand new module planned around the RK3588, which should be quite impressive.

One of the design decisions of the TP2 is to use the mini-ITX form-factor and 24-pin ATX power connection, giving us the option to install the TP2 in a small computer case. There’s even a custom rack-mountable case being planned by the folks over at My Electronics. So if you want 4 or 8 Raspberry Pis in a rack mount, this one’s for you.

@jp_bennett you mean something like this except in 2U, and full mini-ITX support? Relax, only thing you need is some patience… pic.twitter.com/vQcVCwmgDc

— MyElectronics.nl (@MyElectronicsNL) June 11, 2022

The Appeal — And the Risks

“Wait, wait”, I hear you say, “There’s plenty of ways to rack-mount Raspberry Pis!” Certainly. The form factor options are handy, but the real magic is the rest of the board. Individually controlled power supply for all four boards from a single ATX power supply makes for a very clean solution. Need to reboot a hung Pi remotely? There’s the Baseboard Management Controller (BMC) that will do full power control over the network. That’s the real killer feature: the BMC is going to run Open Source firmware, and will power some very clever functions. Want UART to troubleshoot a boot problem? It’s available from all four nodes on the BMC. Need to push a new image to a CM4? The BMC will include image flashing functions. Built into the board is a Gigabit network switch linking the Pis, the BMC, and two external Ethernet ports, all supporting VLANs.

On the other hand, not much of the BMC wizardry is actually implemented yet on the review units. This is the project’s biggest promise and the place it could go awry. Putting together a stable firmware with all the bells and whistles in the three months before scheduled ship date may be a bit optimistic. I’m expecting a working firmware, with updates to refine the experience in the months following launch.

Then there’s the expanded IO. The board comes with a pair of Mini PCIe ports, 4 USB3 ports, and a pair of SATA ports. This works via the PCIe lanes exposed by the various compute modules. Nodes 1 and 2 are connected to the mini PCIe ports, node 3 to the SATA, and node 4 to the USB3 ports. On top of that, a switchable USB2 port can be dynamically assigned to any of the existing nodes. Oh, and there’s an HDMI output from node 1, so even more options, like running a Pi CM4 8GB as a desktop machine. A late option added to the Kickstarter bolts four NVMe ports to the bottom of the board, one per slot, though not every compute module has the PCIe lanes to support it.

Now keep in mind that I’m testing a pre-production unit (more on that later), and not all of the above is actually working yet. Quite a few changes are slated for the production boards vs my unit, and the BMC firmware on this board is absolutely minimal. There is also the supply-chain issues we’ve continued to cover here on Hackaday, but the TP2 has the advantage of being designed during the shortage, so should be able to avoid using hard-to-source parts.

Use-Case

Now let’s talk about what this *doesn’t* do. This may seem obvious, but the Turing Pi 2 doesn’t give you a single ARM machine with 16+ processing cores. There isn’t enough magic onboard to make the devices act like a unified multi-processor computer. I’m not sure there’s enough magic anywhere to really pull that off. However, what you do get is four easily-managed machines that are perfect for running light-weight services or Docker images.

Looking for a platform for learning Docker and Kubernetes? Or a place to host Gitlab, Nextcloud, and a file server? Maybe you want to play Nginx as a front-end proxy, and several devices running services behind it? The Homelab-in-a-box nature of the TP2 makes it a useful choice for all of the above. And even though you can’t reasonably do all the above on a single Raspberry Pi, a programmable cluster of 4 of them does the job quite nicely. The VLAN support means that you can add virtual NICs to your nodes, and create an internal network. With the two physical Ethernet ports, you could even use your TP2 as your primary router, on top of everything else it can do.

Real-World Testing

So what’s the actual state of the project? I have my pre-production board currently booting a Raspberry Pi CM4, a Pine64 SOQuartz module, an NVIDIA Jetson Nano, and the Jetson TX2 NX. The Jetson Xavier NX had a quirk requiring a minor board modification, but runs like a champ once that was done. There are the normal warts of a pre-production board, like extra dip switches all over the place, and a few quirks, like Ethernet only coming up at 100M for some devices. These are known issues, and a good example of why you do a test run of rev 0 boards. The final product should have all the kinks worked out.

I’ve been monitoring power draw, and the most I’ve managed to pull is a mere 30 watts of power. This suggests a real-world use case, an off-grid compute cluster. The mini-PCIe ports should allow for an LTE modem (Or you can use Starlink if you’re *way* off grid). Add a couple cameras and install the Zoneminder docker images, and you have a low-power video monitoring solution. Add a RTL-SDR dongle, and the rtl_433 software listening to a solar-powered weather station, and you can track the weather at your remote location, too. Just for fun, I ran a Janus docker image on one of the Raspberry Pi CM4s on my TP2. Janus is the WebRTC server we’ve integrated into Zoneminder, and I was able to live stream 12 security cameras at 1080p, only using around 25% of the available processor power, or a load of 1 on a four core Pi. It’s a testament to how lightweight Janus is, but also a great example of something useful you could do with a TP2.

What’s Next

The Kickstarter is over, with better than two million dollars raised, but don’t sweat it, because you will soon be able to purchase a Turing Pi 2. Ordering will be handled through the Turing Pi website itself, stay tuned for the details. There will be a few months til the final revision of the board is finished and shipped, hopefully with some killer firmware and everything working exactly as advertised. Then finally there’s the alluring RK1 compute board, with up to 32 GB of ram and eight cores of Arm goodness from the RK3588. That’s a little further out, and may be a second Kickstarter campaign. I asked about mainline support for the RK1, and was told that this is a primary goal, but they’re not exactly sure on the timing. There is quite a bit of excitement around this particular chip, so look forward to the community working together to get all the needed bits in place for mainline support.

There may be an unexpected consequence of the Turing Pi 2 and RK1 using the NVIDIA Jetson SO-DIMM connector. Imagine a handheld device built on the Antmicro open source Jetson Baseboard, that woks with multiple compute modules. I mentioned the Pine64 SOQuartz: That’s not an officially supported board in the TP2, but because Pine64 built it to the CM4 specifications, it clicks right into the adapter card and works like a champ. There’s an interesting possibility that one or two of these compute module interfaces will gain enough of a critical mass, that it gets widely used in devices. And if anyone wondered, using the TP2 CM4 adapter doesn’t magically allow booting a CM4 in a Jetson Nano carrier board. Yes, we checked.

So is the Turing Pi 2 for you? Maybe. If you don’t mind juggling multiple single-board computers, and the mess of cabling required, then maybe not. But if the ability to slot four SBCs in a single mini-ITX case, with a BMC that makes life way easier sounds like a breath of fresh air, then give it a look. The real test will be when the finished product ships, and what shape the support is in. I’m cautiously optimistic that it won’t be terribly late, and that it will have working OSS firmware. I’m looking forward to getting my hands on the final product. Now if you’ll excuse me, I think I need to go set up an automated system for building aarch64 docker images.

Parallel Processing Was Never Quite Done Like This

Jenny List — Sat, 29 Jun 2019 08:00:00 +0000

Parallel processing is an idea that will be familiar to most readers. Few of you will not be reading this on a device with only one processor core, and quite a few of you will have experimented with clusters of Raspberry Pi or similar SBCs. Instead of one processor doing tasks sequentially, the idea goes, take a bunch of processors and hand out the tasks to be done simultaneously.

It’s a fair bet though that few of you will have designed and constructed your own parallel processing architecture. [BB] sends us a link which though it’s an old one is interesting enough to bring you today: [Michael] created a massively parallel array of Parallax Propeller microcontrollers back in 2008, and he did so on a breadboard.

The Parallax Propeller is an 8-core RISC microcontroller from the company that had found success in the 1990s with the BASIC Stamp, the PIC-based board that was all the rage before Arduino came into the world. In the last decade it was seen as an extremely exciting prospect, but high price and arcane development tools compared to a new generation of low-cost and easy to code competitors meant that it never quite caught on and remains today something of an intriguing oddity. So today’s value in this project lies not in something that you should run out and do yourselves, but instead in what the work tells us about the nuts and bolts of parallel processing architecture. It involves more than simply hooking up a load of chips and hoping for the best, and we gain some insight into the different strategies involved.

The Propeller certainly wasn’t the first attempt at a massively parallel microcontroller, and we doubt it will be the last. We’re certainly seeing microcontrollers with more than one core becoming more mainstream even in our community, but even with those how many of you have made use of the second core in your dual-core ESP32? Is a multicore microcontroller a solution searching for a problem, or will somebody one day crack it and the world will never be the same again? As always, the comments are below.

CUDA is Like Owning a Supercomputer

Al Williams — Mon, 19 Mar 2018 17:00:58 +0000

The word supercomputer gets thrown around quite a bit. The original Cray-1, for example, operated at about 150 MIPS and had about eight megabytes of memory. A modern Intel i7 CPU can hit almost 250,000 MIPS and is unlikely to have less than eight gigabytes of memory, and probably has quite a bit more. Sure, MIPS isn’t a great performance number, but clearly, a top-end PC is way more powerful than the old Cray. The problem is, it’s never enough.

Today’s computers have to processes huge numbers of pixels, video data, audio data, neural networks, and long key encryption. Because of this, video cards have become what in the old days would have been called vector processors. That is, they are optimized to do operations on multiple data items in parallel. There are a few standards for using the video card processing for computation and today I’m going to show you how simple it is to use CUDA — the NVIDIA proprietary library for this task. You can also use OpenCL which works with many different kinds of hardware, but I’ll show you that it is a bit more verbose.

Dessert First

One of the things that’s great about being an adult is you are allowed to eat dessert first if you want to. In that spirit, I’m going to show you two bits of code that will demonstrate just how simple using CUDA can be. First, here’s a piece of code known as a “kernel” that will run on the GPU.

__global__
void scale(unsigned int n, float *x, float *y)
{
  int i = threadIdx.x;
  x[i]=x[i]*y[i];
}

There are a few things to note:

The __global__ tag indicates this function can run on the GPU
The set up of the variable “i” gives you the current vector element
This example assumes there is one thread block of the right size; if not, the setup for i would be slightly more complicated and you’d need to make sure i < n before doing the calculation

So how do you call this kernel? Simple:

scale<<<1,1024>>>(1024,x,y);

Naturally, the devil is in the details, but it really is that simple. The kernel, in this case, multiplies each element in x by the corresponding element in y and leaves the result in x. The example will process 1024 data items using one block of threads, and the block contains 1024 threads.

You’ll also want to wait for the threads to finish at some point. One way to do that is to call cudaDeviceSynchronize().

By the way, I’m using C because I like it, but you can use other languages too. For example, the video from NVidia, below, shows how they do the same thing with Python.

Grids, Blocks, and More

The details are a bit uglier, of course, especially if you want to maximize performance. CUDA abstracts the video hardware from you. That’s a good thing because you don’t have to adapt your problem to specific video adapters. If you really want to know the details of the GPU you are using, you can query it via the API or use the deviceQuery example that comes with the developer’s kit (more on that shortly).

For example, here’s a portion of the output of deviceQuery for my setup:

CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1060 3GB"
CUDA Driver Version / Runtime Version 9.1 / 9.1
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 3013 MBytes (3158900736 bytes)
( 9) Multiprocessors, (128) CUDA Cores/MP: 1152 CUDA Cores
GPU Max Clock rate: 1772 MHz (1.77 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
. . .
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes

Some of this is hard to figure out until you learn more, but the key items are there are nine multiprocessors, each with 128 cores. The clock is about 1.8 GHz and there’s a lot of memory. The other important parameter is that a block can have up to 1024 threads.

So what’s a thread? And a block? Simply put, a thread runs a kernel. Threads form blocks that can be one, two, or three dimensional. All the threads in one block run on one multiprocessor, although not necessarily simultaneously. Blocks are put together into grids, which can also have one, two, or three dimensions.

So remember the line above that said scale<<>>? That runs the scale kernel with a grid containing one block and the block has 1024 threads in it. Confused? It will get clearer as you try using it, but the idea is to group threads that can share resources and run them in parallel for better performance. CUDA makes what you ask for work on the hardware you have up to some limits (like the 1024 threads per block, in this case).

Grid Stride Loop

One of the things we can do, then, is make our kernels smarter. The simple example kernel I showed you earlier processed exactly one data item per thread. If you have enough threads to handle your data set, then that’s fine. Usually, that’s not the case, though. You probably have a very large dataset and you need to do the processing in chunks.

Let’s look at a dumb but illustrative example. Suppose I have ten data items to process. This is dumb because using the GPU for ten items is probably not effective due to the overhead of setting things up. But bear with me.

Since I have a lot of multiprocessors, it is no problem to ask CUDA to do one block that contains ten threads. However, you could also ask for two blocks of five. In fact, you could ask for one block of 100 and it will dutifully create 100 threads. Your kernel would need to ignore all of them that would cause you to access data out of bounds. CUDA is smart, but it isn’t that smart.

The real power, however, is when you specify fewer threads than you have items. This will require a grid with more than one block and a properly written kernel can compute multiple values.

Consider this kernel, which uses what is known as a grid stride loop:

__global__
void scale(unsigned int n, float *x, float *y)
{
 unsigned int i, base=blockIdx.x*blockDim.x+threadIdx.x, incr=blockDim.x*gridDim.x;
 for (i=base;i<n;i+=incr) // note that i>=n is discarded
   x[i]=x[i]*y[i];
}

This does the same calculations but in a loop. The base variable is the index of the first data item to process. The incr variable holds how far away the next item is. If your grid only has one block, this will degenerate to a single execution. For example, if n is 10 and we have one block of ten threads, then each thread will get a unique base (from 0 to 9) and an increment of ten. Since adding ten to any of the base numbers will exceed n, the loop will only execute once in each thread.

However, suppose we ask for one block of five threads. Then thread 0 will get a base of zero and an increment of five. That means it will compute items 0 and 5. Thread 1 will get a base of one with the same increment so it will compute 1 and 6.

Of course, you could also ask for a block size of one and ten blocks which would have each thread in its own block. Depending on what you are doing, all of these cases have different performance ramifications. To better understand that, I’ve written a simple example program you can experiment with.

Software and Setup

Assuming you have an NVidia graphics card, the first thing you have to do is install the CUDA libraries. You might have a version in your Linux repository but skip that. It is probably as old as dirt. You can also install for Windows (see video, below) or Mac. Once you have that set up, you might want to build the examples, especially the deviceQuery one to make sure everything works and examine your particular hardware.

You have to run the CUDA source files, which by convention have a .cu extension, through nvcc instead of your system C compiler. This lets CUDA interpret the special things like the angle brackets around a kernel invocation.

An Example

I’ve posted a very simple example on GitHub. You can use it to do some tests on both CPU and GPU processing. The code creates some memory regions and initializes them. It also optionally does the calculation using conventional CPU code. Then it also uses one of two kernels to do the same math on the GPU. One kernel is what you would use for benchmarking or normal use. The other one has some debugging output that will help you see what’s happening but will not be good for execution timing.

Normally, you will pick CPU or GPU, but if you do both, the program will compare the results to see if there are any errors. It can optionally also dump a few words out of the arrays so you can see that something happened. I didn’t do a lot of error checking, so that’s handy for debugging because you’ll see the results aren’t what you expect if an error occurred.

Here’s the help text from the program:

So to do the tests to show how blocks and grids work with ten items, for example, try these commands:

./gocuda g p d bs=10 nb=1 10
./gocuda g p d bs=5 nb=1 10

To generate large datasets, you can make n negative and it will take it as a power of two. For example, -4 will create 16 samples.

Is it Faster?

Although it isn’t super scientific, you can use any method (like time on Linux) to time the execution of the program when using GPU or CPU. You might be surprised that the GPU code doesn’t execute much faster than the CPU and, in fact, it is often slower. That’s because our kernel is pretty simple and modern CPUs have their own tricks for doing processing on arrays. You’ll have to venture into more complex kernels to see much benefit. Keep in mind there is some overhead to set up all the memory transfers, depending on your hardware.

You can also use nvprof — included with the CUDA software — to get a lot of detailed information about things running on the GPU. Try putting nvprof in front of the two example gocuda lines above. You’ll see a report that shows how much time was spent copying memory, calling APIs, and executing your kernel. You’ll probably get better results if you leave off the “p” and “d” options, too.

For example, on my machine, using one block with ten threads took 176.11 microseconds. By using one block with five threads, that time went down to 160 microseconds. Not much, but it shows how doing more work in one thread cuts the thread setup overhead which can add up when you are doing a lot more data processing.

OpenCL

OpenCL has a lot of the same objectives as CUDA, but it works differently. Some of this is necessary since it handles many more devices (including non-NVidia hardware). I won’t comment much on the complexity, but I will note that you can find a simple example on GitHub, and I think you’ll agree that if you don’t know either system, the CUDA example is a lot easier to understand.

Next Steps

There’s lots more to learn, but that’s enough for one sitting. You might skim the documentation to get some ideas. You can compile just in time, if your code is more dynamic and there are plenty of ways to organize memory and threads. The real challenge is getting the best performance by sharing memory and optimizing thread usage. It is somewhat like chess. You can learn the moves, but becoming a good player takes more than that.

Don’t have NVidia hardware? You can even do CUDA in the cloud now. You can check out the video for NVidia’s setup instructions.

Just remember when you create a program that processes a few megabytes of image or sound data, that you are controlling a supercomputer that would have made [Seymour Cray’s] mouth water back in 1976.

Neural Nets in the Browser: Why Not?

Al Williams — Fri, 04 Aug 2017 15:30:55 +0000

We keep seeing more and more Tensor Flow neural network projects. We also keep seeing more and more things running in the browser. You don’t have to be Mr. Spock to see this one coming. TensorFire runs neural networks in the browser and claims that WebGL allows it to run as quickly as it would on the user’s desktop computer. The main page is a demo that stylizes images, but if you want more detail you’ll probably want to visit the project page, instead. You might also enjoy the video from one of the creators, [Kevin Kwok], below.

TensorFire has two parts: a low-level language for writing massively parallel WebGL shaders that operate on 4D tensors and a high-level library for importing models from Keras or TensorFlow. The authors claim it will work on any GPU and–in some cases–will be actually faster than running native TensorFlow.

This is a logical progression of using WebGL to do browser-based parallel processing, which we’ve covered before. The work has been done by a group of recent MIT graduates who applied for (and received) an AI Grant for their work. We wonder if some enterprising Hackaday readers might not get some similar financing (be aware, you have to apply by the end of August).

If you have been itching to learn more about TensorFlow, we’ve covered it in depth. If you want the bare-bones example, we’ve looked at that, too.

Thanks [Patrick] for the tip.

1000 CPUs on a Chip

Al Williams — Mon, 20 Jun 2016 23:01:57 +0000

Often, CPUs that work together operate on SIMD (Single Instruction Multiple Data) or MISD (Multiple Instruction Single Data), part of Flynn’s taxonomy. For example, your video card probably has the ability to apply a single operation (an instruction) to lots of pixels simultaneously (multiple data). Researchers at the University of California–Davis recently constructed a single chip with 1,000 independently programmable processors onboard. The device is energy efficient and can compute up to 1.78 trillion instructions per second.

The KiloCore chip (not to be confused with the 2006 Rapport chip of the same name) has 621 million transistors and uses special techniques to be energy efficient, an important design feature when dealing with so many CPUs. Each processor operates at 1.78 GHz or less and can shut itself down when not needed. The team reports that even when computing 115 billion instructions per second, the device only consumes about 700 milliwatts.

Unlike some multicore designs that use a shared memory area to communicate between processors, the KiloCore allows processors to directly communicate. If you are just a diehard Arduino user, maybe you could scale up this design. Or, if you want to make use of the unused power in your video card under Linux, you can always try to bring KGPU up to date.

Tote Boards: the Impressive Engineering of Horse Gambling

Kristina Panos — Wed, 04 Nov 2015 15:01:17 +0000

Horse racing has been around since the time of the ancient Greeks. Often called the sport of kings, it was an early platform for making friendly wagers. Over time, private bets among friends gave way to bookmaking, and the odds of winning skewed in favor of a new concept called the “house”.

During the late 1860s, an entrepreneur in Paris named Joseph Oller invented a new form of betting he called pari-mutuel. In this method, bettors wager among themselves instead of against the house. Bets are pooled together and the winnings divided among the bettors. Pari-mutuel betting creates more organic odds than ones given by a profit-driven bookmaker.

Oller’s method caught on quite well. It brought fairness and transparency to betting, which made it even more attractive. It takes a lot of quick calculations to show real-time bet totals and changing odds, and human adding machines presented a bottleneck. In the early 1900s, a man named George Julius would change pari-mutuel technology forever by making an automatic vote-counting machine in his garage.

Sunday cockfight at Madrid. Image via “Sunday Cockfight at Madrid” by (artist not specified) – wood engraving published in Harper’s Weekly, September 1873.. Licensed under Public Domain via Commons.

" data-medium-file="https://hackaday.com/wp-content/uploads/2015/11/cockfighting-madrid.jpg?w=400" data-large-file="https://hackaday.com/wp-content/uploads/2015/11/cockfighting-madrid.jpg?w=640" class="wp-image-176255 size-medium" src="https://hackaday.com/wp-content/uploads/2015/11/cockfighting-madrid.jpg?w=400" alt=""Sunday Cockfight at Madrid" by (artist not specified) - wood engraving published in Harper's Weekly, September 1873.. Licensed under Public Domain via Commons." width="400" height="390" srcset="https://hackaday.com/wp-content/uploads/2015/11/cockfighting-madrid.jpg 871w, https://hackaday.com/wp-content/uploads/2015/11/cockfighting-madrid.jpg?resize=250,244 250w, https://hackaday.com/wp-content/uploads/2015/11/cockfighting-madrid.jpg?resize=400,390 400w, https://hackaday.com/wp-content/uploads/2015/11/cockfighting-madrid.jpg?resize=640,625 640w" sizes="auto, (max-width: 400px) 100vw, 400px" />

“Sunday Cockfight at Madrid” by (artist not specified) – wood engraving published in Harper’s Weekly, September 1873.

Gambler vs. Gambler

Horse racing was an extremely popular source of entertainment in nineteenth century Europe, due in part to the economic upswing of the Industrial Revolution. Racing’s popularity was boosted further by pari-mutuel betting. Joseph Oller came up with the method in Spain while watching arguments break out over cock fighting bets. He created the pari-mutuel system to benefit the learned bettor. Essentially, he sought to cut out the bookmaker and his ability to fix the odds. Instead of each gambler betting against the house, Oller’s method pits gambler against gambler. The odds of winning are in flux until the betting period ends.

In the pari-mutuel betting system, all the bets for a given horse are pooled together. After the winner is determined, a commission percentage is taken from the grand total of all bets placed. This goes to whoever owns the means to run the betting. The remaining amount is divided by the total amount wagered on the winner, giving x profit per dollar wagered. If this works out to say, $10 per $1 wager, then the odds of winning were 10-1. The various systems used to tally the bets came to be called totalisators.

Early totalisator used at Auckland. Image source:

" data-medium-file="https://hackaday.com/wp-content/uploads/2015/11/early-tote.jpg?w=400" data-large-file="https://hackaday.com/wp-content/uploads/2015/11/early-tote.jpg?w=500" class="wp-image-176264 size-medium" src="https://hackaday.com/wp-content/uploads/2015/11/early-tote.jpg?w=400" alt="" width="400" height="390" srcset="https://hackaday.com/wp-content/uploads/2015/11/early-tote.jpg 500w, https://hackaday.com/wp-content/uploads/2015/11/early-tote.jpg?resize=250,244 250w, https://hackaday.com/wp-content/uploads/2015/11/early-tote.jpg?resize=400,390 400w" sizes="auto, (max-width: 400px) 100vw, 400px" />

Early totalisator used at Auckland. Image source: The Rutherford Journal

Totali-what?

In essence, a totalisator or tote board is made up of a number of counters that are used to display running totals. The term quickly became synonymous with pari-mutuel betting. Tote boards are largely associated with sports betting, but they are also used to keep track of and display pledge amounts during telethons.

In early pari-mutuel history, bet tallies at the racetrack were kept manually on chalkboards. As pari-mutuel betting increased in popularity, the totalisator concept was adapted to keep up with real-time demand. A number of different small machines were built to do the counting and brought to the track as a betting alternative. Some tote board owners went out and met the crowds in wagons.

Bettors placed more trust in the machines than they did the guys with chalkboards, but their confidence was a bit misplaced. The machines were ultimately operated by humans, some of whom were not above entering phony bets. Even so, horse racing continued to grow in popularity. Several tote boards sitting side by side were necessary to keep up with the demand. At larger racetracks, the small and portable tote boards began to move into dedicated buildings so they could handle more bets.

George Julius’ automatic tote prototype. Image source: The Rutherford Journal

Electing a Winner

George Julius was a lifelong engineer. He took an early interest in mechanical operations, particularly those of clockwork. Julius was born in England, but moved to Australia and later, New Zealand as his father was promoted within the Anglican Church. Julius studied mechanical engineering and worked in both railway and timber engineering in Australia.

In his spare time, Julius built a machine to automatically count election votes. He presented it to both the Western and the Federal Australian governments, but neither one accepted his design. A friend took Julius to a nearby racetrack to show him another possibility for his machine. Because of his religious upbringing, he had never been exposed to horse racing or gambling. Julius was intrigued by the logistical problems inherent in pari-mutuel betting, and sought to create a device that could handle all the parallel arithmetic. He spent the next four years building a small automatic totalisator out of his garage.

The first automatic totalisator installation. Ellerslie racetrack, Auckland, New Zealand. Image source: The Rutherford Journal

Inside the tote at Ellerslie. Image source: The Rutherford Journal

" data-medium-file="https://hackaday.com/wp-content/uploads/2015/11/ellerslie-tote.jpg?w=400" data-large-file="https://hackaday.com/wp-content/uploads/2015/11/ellerslie-tote.jpg?w=500" class="size-medium wp-image-176269" src="https://hackaday.com/wp-content/uploads/2015/11/ellerslie-tote.jpg?w=400" alt="Inside the tote at Ellerslie. Image source: The Rutherford Journal" width="400" height="326" srcset="https://hackaday.com/wp-content/uploads/2015/11/ellerslie-tote.jpg 500w, https://hackaday.com/wp-content/uploads/2015/11/ellerslie-tote.jpg?resize=250,204 250w, https://hackaday.com/wp-content/uploads/2015/11/ellerslie-tote.jpg?resize=400,326 400w" sizes="auto, (max-width: 400px) 100vw, 400px" />

Inside the tote at Ellerslie. Image source: The Rutherford Journal

Multi-Story Computing

Julius’ first commercial automatic tote was installed in 1913 at Ellerslie racetrack in Auckland, New Zealand. The machine was so large that it required its own multi-story building called a tote house. The Ellerslie machine could perform simultaneous bet summation for up to thirty horses. It displayed in real time the approximate odds for each horse to win, the total running amount wagered on each horse, and a grand total of wagers made in the event. The first floor of the tote house had thirty ticketing windows where bets were placed. The rest of the building was devoted to totalisator machinery. The tallied bets and approximate odds were displayed in the second floor windows of the tote house. These numerical displays were actually a part of the machine—huge, readable numbers on counter wheels.

This first automatic totalisator was completely mechanical and operated similarly to clockwork. Power came from large iron weights attached to bicycle chains draped over drive sprockets. The Ellerslie machine was only used for five years before it was replaced by an electromechanical tote. This marked the beginning of Auckland Totalisators Limited (ATL), which went on to dominate the international market for the next 50 years.

Levers used to place bets at Ellerslie race track. Image source: The Rutherford Journal

The betting process begins when a ticketing agent pulls a lever corresponding to the horse chosen by the bettor. This lever tugs at one of the 900 steel wires running overhead—one wire per horse, per ticket window at thirty apiece. You can just make them out in the upper right corner of this picture. Bets were taken in the smallest monetary units, and each pull of the lever incremented the bet.

In order to convert the parallel input from all the ticket windows to serial tallies for each horse, Julius invented a mechanism he called a shaft adder. The totalisator at Ellerslie racetrack had one of these mechanical differential adders for each horse. A shaft adder consisted of several sets of epicyclic gears situated along a common shaft. An escapement wheel attached to each gear set prevents it from rotating freely. The shaft adders are summed together to form the running grand total, which is displayed at the top of the tote board. A separate mechanism gave approximate odds using the horse’s current bet total, the current grand total, and some trigonometry.

George Julius’ shaft adder diagram. Image source: The Rutherford Journal

Tote boards quickly became electromechanical. Instead of large weights, the counters were driven by motors. Multiplexed rotary switches allowed the ticketing machines to share an escapement wheel on the shaft adder.

ATL installed totalisators all over the world, completely dominating the market until digital computers made them obsolete. Tote boards were among the first real-time multi-user systems, and helped pave the way for parallel processing.

Main image credit: Brian Conlon