Module reference¶

Each module encapsulates a unit of functionality in the parameter estimation project. The following is a brief description of each module and its most important classes, functions, and variables.

Note

Some downloaded data and results of long-running functions are saved on disk in the cache directory, which will be created as needed. However, there is no logic for updating the cache if the code is modified, which means cache files may need to be occassionally manually deleted.

Package init file¶

Source code: src/__init__.py

Project initialization and common objects.

src.systems = ['PbPb5020']¶: Sets the collision systems for the entire project, where each system is a string of the form '<projectile 1><projectile 2><beam energy in GeV>', such as 'PbPb2760', 'AuAu200', 'pPb5020'. Even if the project uses only a single system, this should still be a list of one system string.

src.keys = ['lambda_jet', 'alpha_s']¶: Design attribute. This is a list of strings describing the inputs. The default is for the example data.

src.labels = ['\\Lambda_{jet}', '\\alpha_s}']¶: Design attribute. This is a list of input labels in LaTeX for plotting. The default is for the example data.

src.ranges = [(0.01, 0.3), (0.05, 0.35)]¶: Design attribute. This is list of tuples of (min,max) for each design input. The default is for the example data.

src.design_array = None¶: Design array to use - should be a numpy array. Keep at None generate a Latin Hypercube with above (specified) range. Design array for example is commented under default.

src.data_list = None¶

Dictionary of the model output. Form MUST be data_list[system][observable][subobservable][{‘Y’: ,’x’: }].

‘Y’ is an (n x p) numpy array of the output.

‘x’ is a (1 x p) numpy array of numeric index of columns of Y (if exists). In the example data, x is p_T.

This MUST be changed from None - no built-in default exists. Uncomment the line below default for example.

src.exp_data_list = None¶

Dictionary of the experimental data. Form MUST be exp_data_list[system][observable][subobservable][{‘y’:,’x’:,’yerr’:{‘stat’:,’sys’}}].

‘y’ is a (1 x p) numpy array of experimental data.

‘x’ is a (1 x p) numpy array of numeric index of columns of Y (if exists). In the example data, x is p_T.

‘yerr’ is a dictionary with keys ‘stat’ and ‘sys’.

‘stat’ is a (1 x p) array of statistical errors.

‘sys’ is a (1 x p) array of systematic errors.

This MUST be changed from None - no built-in default exists. Uncomment the line below default for example.

src.exp_cov = None¶: Experimental covariance matrix. Set exp_cov = None to have the script estimate the covariance matrix. Example commented below default.

src.observables = [('R_AA', [None])]¶: Observables to emulate as a list of 2-tuples (obs, [list of subobs]).

src.parse_system(system)[source]¶: Parse a system string into a pair of projectiles and a beam energy.

Design¶

Source code: src/design.py

Generates Latin-hypercube parameter designs.

When run as a script, writes input files for use with my heavy-ion collision event generator. Run python -m src.design --help for usage information.

Warning

This module uses the R lhs package to generate maximin Latin-hypercube samples. As far as I know, there is no equivalent library for Python (I am aware of pyDOE, but that uses a much more rudimentary algorithm for maximin sampling).

This means that R must be installed with the lhs package (run install.packages('lhs') in an R session).

src.design.generate_lhs(npoints, ndim, seed)[source]¶: Generate a maximin Latin-hypercube sample (LHS) with the given number of points, dimensions, and random seed.

class src.design.Design(system, keys=['lambda_jet', 'alpha_s'], ranges=[(0.01, 0.3), (0.05, 0.35)], labels=['\Lambda_{jet}', '\alpha_s}'], array=None, npoints=500, validation=False, seed=None)[source]¶

Latin-hypercube model design.

Creates a design for the given system with the given number of points. Creates the main (training) design if validation is false (default); creates the validation design if validation is true. If seed is not given, a default random seed is used (different defaults for the main and validation designs).

Public attributes:

system: the system string
projectiles, beam_energy: system projectile pair and beam energy
type: ‘main’ or ‘validation’
keys: list of parameter keys
labels: list of parameter display labels (for TeX / matplotlib)
range: list of parameter (min, max) tuples
min, max: numpy arrays of parameter min and max
ndim: number of parameters (i.e. dimensions)
points: list of design point names (formatted numbers)
array: the actual design array

The class also implicitly converts to a numpy array.

This is probably the worst class in this project, and certainly the least generic. It will probably need to be heavily edited for use in any other project, if not completely rewritten.

write_files(basedir)[source]¶: Write input files for each design point to basedir.

Emulator¶

Source code: src/emulator.py

Trains Gaussian process emulators.

When run as a script, allows retraining emulators, specifying the number of principal components, and other options (however it is not necessary to do this explicitly — the emulators will be trained automatically when needed). Run python -m src.emulator --help for usage information.

Uses the scikit-learn implementations of principal component analysis (PCA) and Gaussian process regression.

class src.emulator.Emulator(system, npc=10, nrestarts=0)[source]¶

Multidimensional Gaussian process emulator using principal component analysis.

The model training data are standardized (subtract mean and scale to unit variance), then transformed through PCA. The first npc principal components (PCs) are emulated by independent Gaussian processes (GPs). The remaining components are neglected, which is equivalent to assuming they are standard zero-mean unit-variance GPs.

This class has become a bit messy but it still does the job. It would probably be better to refactor some of the data transformations / preprocessing into modular classes, to be used with an sklearn pipeline. The classes would also need to handle transforming uncertainties, which could be tricky.

classmethod from_cache(system, retrain=False, **kwargs)[source]¶: Load the emulator for system from the cache if available, otherwise train and cache a new instance.

predict(X, return_cov=False, extra_std=0)[source]¶

Predict model output at X.

X must be a 2D array-like with shape (nsamples, ndim). It is passed directly to sklearn GaussianProcessRegressor.predict().

If return_cov is true, return a tuple (mean, cov), otherwise only return the mean.

The mean is returned as a nested dict of observable arrays, each with shape (nsamples, n_cent_bins).

The covariance is returned as a proxy object which extracts observable sub-blocks using a dict-like interface:

>>> mean, cov = emulator.predict(X, return_cov=True)

>>> mean['dN_dy']['pion']
<mean prediction of pion dN/dy>

>>> cov[('dN_dy', 'pion'), ('dN_dy', 'pion')]
<covariance matrix of pion dN/dy>

>>> cov[('dN_dy', 'pion'), ('mean_pT', 'kaon')]
<covariance matrix between pion dN/dy and kaon mean pT>

The shape of the extracted covariance blocks are (nsamples, n_cent_bins_1, n_cent_bins_2).

NB: the covariance is only computed between observables and centrality bins, not between sample points.

extra_std is additional uncertainty which is added to each GP’s predictive uncertainty, e.g. to account for model systematic error. It may either be a scalar or an array-like of length nsamples.

sample_y(X, n_samples=1, random_state=None)[source]¶

Sample model output at X.

Returns a nested dict of observable arrays, each with shape (n_samples_X, n_samples, n_cent_bins).

MCMC¶

Source code: src/mcmc.py

Markov chain Monte Carlo model calibration using the affine-invariant ensemble sampler (emcee).

This module must be run explicitly to create the posterior distribution. Run python -m src.mcmc --help for complete usage information.

On first run, the number of walkers and burn-in steps must be specified, e.g.

python -m src.mcmc --nwalkers 500 --nburnsteps 100 200

would run 500 walkers for 100 burn-in steps followed by 200 production steps. This will create the HDF5 file mcmc/chain.hdf (default path).

On subsequent runs, the chain resumes from the last point and the number of walkers is inferred from the chain, so only the number of production steps is required, e.g.

python -m src.mcmc 300

would run an additional 300 production steps (total of 500).

To restart the chain, delete (or rename) the chain HDF5 file.

class src.mcmc.Chain(path=Path('mcmc/chain.hdf'))[source]¶

High-level interface for running MCMC calibration and accessing results.

Currently all design parameters except for the normalizations are required to be the same at all beam energies. It is assumed (NOT checked) that all system designs have the same parameters and ranges (except for the norms).

log_posterior(X, extra_std_prior_scale=0.05, model_sys_error=False)[source]¶

Evaluate the posterior at X.

extra_std_prior_scale is the scale parameter for the prior distribution on the model sys error parameter:

prior ~ sigma^2 * exp(-sigma/scale)

This model sys error parameter is not by default implemented.

random_pos(n=1)[source]¶: Generate n random positions in parameter space.

run_mcmc(nsteps, nburnsteps=None, nwalkers=None, status=None)[source]¶: Run MCMC model calibration. If the chain already exists, continue from the last point, otherwise burn-in and start the chain.

open(mode='r')[source]¶: Return a handle to the chain HDF5 file.

dataset(mode='r', name='chain')[source]¶

Context manager for quickly accessing a dataset in the chain HDF5 file.

>>> with Chain().dataset() as dset:
        # do something with dset object

load(*keys, thin=1)[source]¶: Read the chain from file. If keys are given, read only those parameters. Read only every thin’th sample from the chain.

samples(n=1)[source]¶: Predict model output at n parameter points randomly drawn from the chain.

src.mcmc.credible_interval(samples, ci=0.9)[source]¶: Compute the highest-posterior density (HPD) credible interval (default 90%) for an array of samples.

Plots and figures¶

Source code: src/plots.py

Generates plots / figures when run as a script. Plot files are placed in the plots directory.

By default, simply running python -m src.plots generates ALL plots, which may not be desired. Instead, one can pass a list of plots to generate: python -m src.plots plot1 plot2 .... The full list of plots is shown in the usage information python -m src.plots --help.

Typing can be reduced by using shell brace expansion, e.g. python -m src.plots observables_{design,posterior} for both observables_design and observables_posterior. In addition, plots may be given as paths to plot filenames, which enables shell globbing, e.g. python -m src.plots plots/observables_*.

In the code, each plot is generated by a function tagged with the @plot decorator.

src.plots.posterior()[source]¶: Pairplot of posteriors for all calibration inputs. Diagonal displays marginal density. Lower off-diagonal displays pairwise scatter plot.

src.plots.observables_design()[source]¶: Model observables at design points, with experimental data plotted as reference. For different observables than the example, change the dictionary in _observables_plot()

src.plots.observables_posterior()[source]¶: Model observables at 100 draws from the posterior, with experimental data plotted as reference. For different observables than the example, change the dictionary in _observables_plot()

src.plots.diag_emu(system='PbPb5020')[source]¶: Diagnostic: plots of each principal component vs each input parameter, overlaid by emulator predictions at several points in design space. See how well the emulators track the design points, if uncertainty and shape of predictions are reasonable.

src.plots.design()[source]¶: Projection of a LH design into two dimensions. Change keys within the function to the two inputs you want to protect into.

src.plots.gp()[source]¶: Conditioning a Gaussian process. Simple example plots with dummy data.