**Diffusion or no diffusion**

Particle advection is like dumping tea crumps in hot water and whirling them around. If you stir the tea, the little particles get moved around by the flow, but they technically don't mix. Diffusion, an important physical mechanism, is missing. It would be absolutely possible to reverse the tea particle "mixing" and put all the little crumps back to where they were before you started stirring.

However, if you try the same with milk then the milk gets mixed with the tea and you won't be able to separate tea from milk again. So diffusion is not reversible, and actually if you've ever seen a movie of diffusion played in reverse your brain tells you immediately that something is wrong. That's because our brain has implicitly learned the 2nd law thermodynamics, although you may not know what it means. For our tea it means that you could (theoretically) stirr the tea such that the particles move back to where they came from. One probably would need to be quite good in stirring though, but nevertheless. In a sense, for particle advection you can revert time without violating other laws of physics.

What happens when we add diffusion? Well a particle has a certain position, there's (unless you stirr too hard) no dilution of the particles. So it's either here or there but not half a particle here and another quarter there. And that's exactly the underlying problem of why diffusion is not reversible. Imagine you add milk, stirr and then measure that somewhere in the tea there's a droplet with 50% milk and 50% tea, let's call this a concentration of 0.5. You don't know whether it was created from mixing two droplets of concentrations 0 and 1 (as (0.5+0.5)/2 = 0.5) or whether one was 100% milk the other one 0% tea (i.e. (1+0)/2=0.5), or, or, or. There's many ways you could end up with such a milk concentration. So this information is lost due to diffusion.

A 7-day weather forecasts is now as accurate as a 5-day forecast 20 years ago, thanks to faster supercomputers and decades of scientific weather forecast model development. However, there is still a widespread public misbelief that forecasts are not reliable. Traditional meteograms, summarizing the weather for a certain location throughout the next week, rarely contain information about the reliability of a weather forecast. A paradigm exists where a layperson is not expected to understand the forecast uncertainty. Especially smartphone weather apps tend to present forecasts in an overly simplistic way that may support the aforementioned little public confidence in weather forecasting.

The proposed solution is an intuitive way of visualizing the probabilistic information of an ensemble forecast in the form of a meteogram, as the public is mostly interested in the weather at a given location. The meteogram is designed to be easily understandable for everyone without a statistics background. The temperature is plotted as a colour-shaded time series, where width and transparency denote the uncertainty and colour is used to provide an intuitive feeling for the temperature (blue=cold, red=warm). Strong winds can be seen on the bottom of that panel, where a windsock indicates the expected strength. Transparency denotes the uncertainty of the wind forecast. The upper panels provide information about cloudiness and precipitation. Rain and snow are split into three categories: light, medium and heavy intensity that are displayed in three rows, each with transparency to denote the uncertainty of a rainfall event at a given intensity. In case of snow, the rain droplets are changed into snow flakes. The precipitation panel therefore can display events like “very likely weak rain, but unlikely stronger” or “equally likely medium or heavy rain”. Cloudiness is presented in the top panel, indicating low, medium and high altitude clouds in different shades of grey. The thickness represents the cloud cover of each cloud type. Additionally, a lightning bolt may indicate once the chance of thunder is beyond a given threshold. Sunrise and sunset times are provided in the temperature panel.

The weather forecast for Alderaan City shows warm temperatures during the day around 20°C for the next days, although nights are cold. Weak rain and thunder has to be expected on Sunday. Monday will be the most beautiful day of the week with low cloud cover and temperatures around 14°C. During the week, temperatures are expected to drop and uncertainty increases towards next weekend. Strong winds have to be expected for Monday, Tuesday and Wednesday, that will peak on Wednesday noon. Probably strong rainfall from Tuesday onwards, increasing in likelihood and strength towards Thursdat and Friday. Full cloud cover throughout the week, but thinner on the next weekend. We believe that the additional information about the uncertainty of a forecast will increase public confidence in weather forecasting, important for a wider outreach of publicly funded data.

What can be so hard about summing a few numbers?

Recently, Nick Higham posted a blog article where he examined what happens if we compute the harmonic sum, which analytically diverges, with various binary number formats on a computer. Yes, indeed, they all converge because at some point the increment is smaller than half the distance to the next representable number in your given number format. The result is therefore round back, and the next result in your sum is as big as the previous one - that means the sum converged.

An interesting question is when that happens, and highly depends on your number format. Most people use, without thinking about it, 64bit double precision floating-point numbers (also known as Float64, fp64, double, etc.) as it's the standard. However, the future is 16bit. Maybe also 8bit, but for many applications certainly not 64bit, because it's an overkill and a unnecessary high precision that we rather would exchange for speed. That's why Google has 16bit support on their TPUs, and more and more of this kind of hardware is yet to come.

From the floating-point of view there are two strong candidates: Float16 (or fp16, half precision floats) and Bfloat16. The former has a higher precision of almost 4 decimal places but can only represent numbers between 6*10^-8 and 65504. The latter exchanges precision for a wider range that reaches from 10^-45 to 10^38. But there's also another horse in the race: Posits. The posit number format puts a lot of precision for numbers around 1, and has still a wide dynamic range of representable numbers. The current draft standard candidate is Posit(16,1), which represents numbers between 3*10^-9 and 3*10^8 yet has a precision of almost 5 decimal places for numbers around one.

Coming back to the harmonic sum. In order to compare Float16 and Bfloat16, Nick Higham computed the value the harmonic sum converges to using both number formats, and concludes that Float16 actually does a much better job, as the value of 7.0859 is reached after 513 terms. Bfloat16, instead, can only sum up 65 terms to yield 5.0625 before it converges. How would posits perform in comparison?

To test that, I wrote the following Julia function and import the posit arithmetic emulator SoftPosit.jl

Now executing harmsum with Float16 yields what Nick Higham calculated too. However, doing exactly the same with Posit16 (That's the Posit(16,1) format proposed as the new posit standard for 16bit numbers) shows that posits do much better than floats: The sum only converges after 1024 terms and yields 7.78, a value that neither Float16 nor Bfloat16 can produce.

Also the posit number format allows to have a wider dynamic range in exchange for precision. The Posit(16,2) format sacrifices a little bit of the precision for numbers around one but can represent numbers in the range between 10^-17 up to 7*10^16. In contrast to the Bfloat16 format, which is drastically faster converging compared to Float16, the Posit(16,2) format is doing even better on the harmonic series (here called Posit16_2):

The question obviously is, who cares about an idealised mathematical series that is somwhat irrelevant for real world applications that run on big supercomputers? I would claim there is at least a little bit to learn from this: In fact, regarding the harmonic sum as "adding a lot of small numbers to a big number", this is indeed something that happens often on many supercomputers. Think about a weather forecast model for example, where the state of the atmosphere is represented with a whole bunch of numbers. These models use time steps on the order of minutes. As you know from your own experience, the weather, in most cases, does not change that much from one minute to the other. This translates into computing as a rather large number (say temperature at a given location) gets updated with a lot of tiny increments throughout a day that eventually add up to the change from a freezing morning to a beautifully warm and sunny afternoon. Quite similar to the harmonic sum therefore.

Posits are not supported by high performance computing architectures, but this should rapidly change in the future.

To lazy to read? Then directly to github.com/milankl/swmone.

I always wanted to learn some parallel computing. What I mean is not just taking someone's parallelized code and then hurray the cluster is completely blocked and your colleagues start to write angry emails. No, I really wanted to get my hands dirty, writing the parallelization myself. But, you know, there is always more important things to do or you just have a lazy weekend while the cluster is working for you. Then there was Julia, and as weather and climate models heavily rely on MPI, I thought why not using MPI for Julia (instead of Julia's im-built parallelisation approaches). I came across an amazing blog post by Claudio Bellei, who parallelised a diffusion equation solver. However, and please don't judge me on that, I find diffusion problems quite boring. I want to have (at least a tiny bit of) action - so I decided to parallelise the non-linear shallow water equations. These equations also have advection, allow for waves to propagate and can be combined with diffusion too. To keep it simple I dropped 2D and considered 1D instead.

Some parallel code is fairly easy to understand: Say you want to analyse dataset X, which contains a large number of chunks and whatever you analyse in one chunk is completely independent from another. Some people call this embarassingly parallel problems. That's not what I was aiming for. Many people know the butterfly effect of chaotic systems such as weather: A little change on one side of the problem and eventually you get a huge change even on the other side of the globe. This means that weather forecasting is not an embarassingly parallel problem, as every computation will eventually influence every other computation. This influence takes places in terms of communication: information needs to exchanged between the chunks executed in parallel. Not just once, but again and again and again.

Okay, let's go. *First* a short idea what the shallow water model (and how simple it actually is in 1D); *second* the idea of domain decomposition; and *third *how to use MPI for the communication between the sub-domains.

We will use finite-differences on a staggered grid, where the velocity points sit in between the surface elevation points - to interpolate from one to the other we use a simple 2-point average.

These are the two functions that act as the only two operators needed for the 1D shallow water model: A gradient in the only spatial dimension x and an interpolation from one grid to the other.

Don't worry about the @boundscheck and the @inbounds, these are Julia-details to make the code run faster. With these functions, the function rhs then computes the right-hand side of the partial differential equations. Again don't worry about the @views - this is another Julia macro that you can ignore. First we compute from the surface elevation η the layer thickness h, we square the velocity u for the Bernoulli potential and perform some interpolations to get the variables on the correct grids. We include the viscous term as a 1D "stress tensor" in the Bernoulli potential such that in the momentum equations only experience the tendency of this new potential plus the wind forcing

For the continuity equation we evaluate it's tendency straight forward including one interpolation and one gradient. So far so good, these three functions work for the whole domain or even sub-domains, as long as the boundary conditions are contained within additional ghost-points to the left and right - but we will come back to that in a second.

The time integration is done using Runge-Kutta 4th order in a low-memory version where tendencies get summed-up on the go. Once you've written your own RK4 time integrator, that should be pretty easy to understand. This version requires a few copying back and forth between 3 different version of the prognostic variables u,u0 and u1. Using the ghost-point approach where the boundary conditions are contained within additional grid-points just outside the domain, we only have to make sure that before every evaluation of the right-hand side this ghost-point copy got executed, such that the gradient & interpolation functions do the right thing at the boundary, we will come to the details of that ghost_point! function later.

The idea of domain decomposition (and there is many other approaches to parallelisation - often highly dependent on the discretisation method) is to split the whole domain into a few smaller ones, such that one processor doesn't have to calculate the whole thing but only a part of it. In 1D, we simply divide the number of grid cells by the number of processes.

We number the sub-domains by 0,1,2,3... and let every process (numbered by its rank) know what it's left and right neighbour is. This already includes the boundary conditions, which we pick to be periodic. So process with rank 0 has neighbour n-1 to the left and 1 to the right, etc.

Once the domain is split up onto different processes, we simply use the ghost-point copy on every time step to copy values not within one array, but across different array on different processes, let's unwrap this idea:

Basically every MPI code starts with MPI.Init() and a communicator object MPI.COMM_WORLD. In our case, this just contains the information of many processes there are and gives every processes a rank, i.e. 0,1,2,3 etc. As with MPI the same code is executed by every processor, every processor will define the rank and size constant in line 185,186. "size" will be the same on each, but rank will obviously differ.

Before the time integration starts, we take the prognostic variables and pad them with zeros (simply as the ghost-point copy will be executed right thereafter). The ghost_points! function varies depending on whether the code is executed with a single process or in a parallel fashion:

(sequential) Due to the periodic boundary conditions, we take the 2nd entry of the array and copy it into the last and similarly the 2nd last replaces the first entry. This way, computing the gradient (or interpolation) on the boundary includes the correct information of the boundary condition. This is done for both prognostic variables.

(parallel) This is exactly the same except that the 2nd entry of the array is copied into the last of the left neighbour (and not into the very same array). So line 90 and 91, each process sends the 2nd / the 2nd last value (concatentated into an array with length two for both variables) to the respective left / right neighbour, via MPI.Isend. The third argument of that function is an identifier that tags the message you send (the array in the first argument) so that the receiving process (2nd argument) unambiguously can identify the arriving message. So it makes sense to tag this message somehow with the rank number, but as each process is sending and receiving two messages, to avoid overlap we simply offset the tags by 10 for all messages that go to the left and by 20 for all messages for messages that go to the right. But to be fair, that's a pretty random choice.

Once every process has send off these two messages to its neighbours, we will also receive the messages from the same neighbours, which happens in lines 94 and 95. To be precise, the function MPI.Irecv! only requests a message and already points its content to the variables in the first argument (hence the !-notation of Julia). The message will be received subsequently, but as many messages might be en route at the same time this does not happen immediately. Hence, the MPI.Irecv! function gives an object back that can be used with MPI.Waitall! to check that really all messages have been received and only then continue.

Then we simply copy the content of the received messages back into the arrays - et voila our ghost-point copy across processes succeeded!

Great, just one last thing: How do we receive all the data from all processes and gather them on one, for output/plotting purposes? MPI.Gather requests the variable in the first argument from all processes. The second argument specifies which process is actually requesting. Here, we identify rank 0 with the main process and the communicator comm is passed on as always. The result of the MPI.Gather function will be stored on the receiving processes, but is essentially a long array that simply concatenates all arrays from all processes. Hence, process 0 will rashape this array in the desired shape (that we will use later for plotting).

Finally, MPI.Barrier and MPI.Finalize() ends the communication between the processes and we continue with the plotting of the data to make sure that actually did the right thing. And as you can see in the little gif above, yeha it worked! If I were a proper computer scientist I would probably need to finish here with a little speed comparison, some weak or strong scaling, but I really wanted to keep it simple. I learned something from this little project, and that was definitely the main goal.

Climate scientist

@ University of Oxford

Climate computing, information theory, predictability, climate change, aviation.

#JuliaLang open-source developer, vegan, low CO₂, bicycles and co-op living, no borders, free education.

*354ppm. he|him

@ University of Oxford

Climate computing, information theory, predictability, climate change, aviation.

#JuliaLang open-source developer, vegan, low CO₂, bicycles and co-op living, no borders, free education.

*354ppm. he|him