Welcome to GCMCWorkflow’s documentation!¶
Welcome to the documentation for GCMCWorkflow, a package for automating Grand Canonical Monte Carlo sampling of porous materials.
Installing GCMCWorkflow¶
Installing the GCMCWorkflow package¶
The easiest way to install the GCMCWorkflow Python package is using Anaconda from the conda-forge channel:
conda install -c conda-forge gcmcworkflow
or pip:
pip install gcmcworkflow
Setting up a Launchpad¶
GCMCWorkflow is powered by Fireworks and requires a functioning LaunchPad (database of jobs). For instructions on how to do this, please consult the official instructions Ultimately this should result in a my_launchpad.yaml file, which works with the various Fireworks commands such as lpad get_fws.
GCMCWorkflow tutorial¶
In this tutorial we will run three isotherms of Argon in IRMOF-1 using
GCMCWorkflow.
To follow this tutorial, you will need to have installed this package
succesfully.
To check this, try running gcmcworkflow -h
from the terminal,
you should see the help message for the command line tool we will
be using.
Preparing input files¶
In order to run our GCMC simulations we must provide a single working input for Raspa. This file will be used as a template for all further simulations, with the temperature, pressure and simulation duration settings being customised as necessary. The forcefield setup and MC move descriptions will not be modified however, so these should be checked!
Preparing the spec file¶
With the Raspa input ready, we now create the input file for GCMCWorkflow, know as the ‘spec’ file. This file will have all the information on sampling we want to perform, using the Raspa input we just prepared.
A template for this spec file can be generated using the
gcmcworkflow genspec
command.
Now edit the spec file to read as follows,
these settings will be explained below.
name: IRMOF1_Ar_tutorial
template: ./irmof/
workdir: ./
pressures: [10k, 20k, 40k]
temperatures: [78.0, 98.0, 118.0]
- Explaining the spec file lines:
- name – this is a unique identifier for this Workflow. It must be unique on the LaunchPad and will be used to refer to this Workflow in the future
- template – path to the simulation input we want to use
- workdir – path to where we want to run the simulations
Submitting work to LaunchPad¶
Now we’ve defined our input and desired sampling we need to submit this information to our LaunchPad.
- check launchpad connectivity
Running our Workflow¶
- check Raspa installation
Checking our results¶
Notes on using in practice¶
- Using GCMCWorkflow on a cluster
- packing jobs together in scripts
- checking progress of jobs
- accessing results once finished
- debugging failures
Method descriptions¶
Equilibrium finding¶
Each individual sampling point that is simulated will initially start with an empty system. As the simulation progresses the system will gradually be filled with gas molecules until an equilibrium amount is reached. Data can only be gathered from the system once equilibrium has been reached.
An example timeseries from a simulation is given below, with the right hand panel showing a close up of the data.

This page details the algorithm which detects if and when a simulation has reached equilibrium. This method attempts to find the earliest point in the timeseries where equilibrium has been reached. It will also be responsible for detecting when equilibrium has not been reached.
Monotonic fit¶
From inspecting the timeseries, it is immediately obvious there is some “noise” in the signal. This isn’t actually noise, but instead the natural oscillations around the mean value in the system. To help smooth the data, a monotonic regression is fitted to the entire timeseries. A monotonic regression can be thought of as a rolling line of best fit which can only increase or remain constant, but never decrease. As previously mentioned, the system initially starts at 0 and will rise to an equilibrium value, therefore the monotonic fit provides a good model of the smoothed behaviour of the timeseries.

Estimating the mean and standard deviation¶
Once the signal has reached equilibrium, it will oscillate around a mean value. We therefore want to identify a value which represents the lower bound of what is possible. To do this we assume that the final half of the signal can be used to represent the mean and standard deviation of the timeseries. We then choose a value of 2 standard deviations below the mean to represent our lower bound. We will of course later check the assumptions made here. It is important to note that these values will not be correct estimates of the data, but they are useful for identifying the equilibration point.

Finding the equilibration point¶
We can then combine these two pieces of information together to find where our model of the smoothed timeseries first crosses our estimate of the lower bound of the data. This interception is then proposed as the equilibration point, with all values in the timeseries after this point to be used as data.

Checking our assumptions¶

At this point we must check our hypothesis that this is the equilibration point. To do this, an augmented Dickey-Fuller test is used to check if the timeseries after the proposed equilibration point is stationary (ie flat). If the timeseries is found to have statistically significant drift, then the hypothesis is rejected and further simulation is required to gather data.
Automatic Runlength¶
How automatic runlength works
- once eq is found, we can measure how much data we have produced
- use heuristic of g ~~ n_eq [from Mol. Sim paper]
- sampling is defined in terms of number of g to collect
- if enough g hasn’t been collected, a restart is issued

Crash Recovery¶
When running simulations on a HPC cluster, there is often a maximum wall time for your job, typically 1 or 2 days. However for some GCMC simulations, especially at higher pressures, this is not long enough to collect enough data. GCMCWorkflow is able to work around this problem by allowing the restart of terminated simulations.
After a RunSimulation task has been performed, the PostProcess task checks if the simulation exited correctly. If it finds that the simulation didn’t exit correctly, then it ascertains how long it should have ran for, and how long the simulation actually ran for. It then parses what data was generated (so this is not wasted), then creates a new RunSimulation stage in the Workflow which will complete the remaining number of steps.
If a job was killed due to walltime constraints, the job will naturally not have been able to communicate that it was killed. These jobs will therefore appear as if they are still running, even after the job on the HPC cluster has terminated. Manual intervention is required to find such jobs and mark them as finished; the following Fireworks command can be used:
``lpad detect_lostruns --fizzle``
This finds jobs which are “lost”, ie haven’t reported their status in a long time, and marks them as “FIZZLED”, the Fireworks term for failure.
Description of how all the magic works