Introduction
Measurements in High Energy Physics (HEP) rely on determining the compatibility of observed collision events with theoretical predictions. The relationship between them is often formalised in a statistical model \(f(\bm{x}|\fullset)\) describing the probability of data \(\bm{x}\) given model parameters \(\fullset\). Given observed data, the likelihood \(\mathcal{L}(\fullset)\) then serves as the basis to test hypotheses on the parameters \(\fullset\). For measurements based on binned data (histograms), the \(\HiFa{}\) family of statistical models has been widely used in both Standard Model measurements [intro-4] as well as searches for new physics [intro-5]. In this package, a declarative, plain-text format for describing \(\HiFa{}\)-based likelihoods is presented that is targeted for reinterpretation and long-term preservation in analysis data repositories such as HEPData [intro-3].
HistFactory
Statistical models described using \(\HiFa{}\) [intro-2] center around the simultaneous measurement of disjoint binned distributions (channels) observed as event counts \(\channelcounts\). For each channel, the overall expected event rate [1] is the sum over a number of physics processes (samples). The sample rates may be subject to parametrised variations, both to express the effect of free parameters \(\freeset\) [2] and to account for systematic uncertainties as a function of constrained parameters \(\constrset\). The degree to which the latter can cause a deviation of the expected event rates from the nominal rates is limited by constraint terms. In a frequentist framework these constraint terms can be viewed as auxiliary measurements with additional global observable data \(\auxdata\), which paired with the channel data \(\channelcounts\) completes the observation \(\bm{x} = (\channelcounts,\auxdata)\). In addition to the partition of the full parameter set into free and constrained parameters \(\fullset = (\freeset,\constrset)\), a separate partition \(\fullset = (\poiset,\nuisset)\) will be useful in the context of hypothesis testing, where a subset of the parameters are declared parameters of interest \(\poiset\) and the remaining ones as nuisance parameters \(\nuisset\).
Thus, the overall structure of a \(\HiFa{}\) probability model is a product of the analysis-specific model term describing the measurements of the channels and the analysis-independent set of constraint terms:
where within a certain integrated luminosity we observe \(n_{cb}\) events given the expected rate of events \(\nu_{cb}(\freeset,\constrset)\) as a function of unconstrained parameters \(\freeset\) and constrained parameters \(\constrset\). The latter has corresponding one-dimensional constraint terms \(c_\singleconstr(a_\singleconstr|\,\singleconstr)\) with auxiliary data \(a_\singleconstr\) constraining the parameter \(\singleconstr\). The event rates \(\nu_{cb}\) are defined as
The total rates are the sum over sample rates \(\nu_{csb}\), each determined from a nominal rate \(\nu_{scb}^0\) and a set of multiplicative and additive denoted rate modifiers \(\bm{\kappa}(\fullset)\) and \(\bm{\Delta}(\fullset)\). These modifiers are functions of (usually a single) model parameters. Starting from constant nominal rates, one can derive the per-bin event rate modification by iterating over all sample rate modifications as shown in (3).
As summarised in Modifiers and Constraints, rate modifications are defined in \(\HiFa{}\) for bin \(b\), sample \(s\), channel \(c\). Each modifier is represented by a parameter \(\phi \in \{\gamma, \alpha, \lambda, \mu\}\). By convention bin-wise parameters are denoted with \(\gamma\) and interpolation parameters with \(\alpha\). The luminosity \(\lambda\) and scale factors \(\mu\) affect all bins equally. For constrained modifiers, the implied constraint term is given as well as the necessary input data required to construct it. \(\sigma_b\) corresponds to the relative uncertainty of the event rate, whereas \(\delta_b\) is the event rate uncertainty of the sample relative to the total event rate \(\nu_b = \sum_s \nu^0_{sb}\).
Modifiers implementing uncertainties are paired with a corresponding default constraint term on the parameter limiting the rate modification. The available modifiers may affect only the total number of expected events of a sample within a given channel, i.e. only change its normalisation, while holding the distribution of events across the bins of a channel, i.e. its “shape”, invariant. Alternatively, modifiers may change the sample shapes. Here \(\HiFa{}\) supports correlated an uncorrelated bin-by-bin shape modifications. In the former, a single nuisance parameter affects the expected sample rates within the bins of a given channel, while the latter introduces one nuisance parameter for each bin, each with their own constraint term. For the correlated shape and normalisation uncertainties, \(\HiFa{}\) makes use of interpolating functions, \(f_p\) and \(g_p\), constructed from a small number of evaluations of the expected rate at fixed values of the parameter \(\alpha\) [3]. For the remaining modifiers, the parameter directly affects the rate.
Description |
Modification |
Constraint Term \(c_\singleconstr\) |
Input |
---|---|---|---|
Uncorrelated Shape |
\(\kappa_{scb}(\gamma_b) = \gamma_b\) |
\(\prod_b \mathrm{Pois}\left(r_b = \sigma_b^{-2}\middle|\,\rho_b = \sigma_b^{-2}\gamma_b\right)\) |
\(\sigma_{b}\) |
Correlated Shape |
\(\Delta_{scb}(\alpha) = f_p\left(\alpha\middle|\,\Delta_{scb,\alpha=-1},\Delta_{scb,\alpha = 1}\right)\) |
\(\displaystyle\mathrm{Gaus}\left(a = 0\middle|\,\alpha,\sigma = 1\right)\) |
\(\Delta_{scb,\alpha=\pm1}\) |
Normalisation Unc. |
\(\kappa_{scb}(\alpha) = g_p\left(\alpha\middle|\,\kappa_{scb,\alpha=-1},\kappa_{scb,\alpha=1}\right)\) |
\(\displaystyle\mathrm{Gaus}\left(a = 0\middle|\,\alpha,\sigma = 1\right)\) |
\(\kappa_{scb,\alpha=\pm1}\) |
MC Stat. Uncertainty |
\(\kappa_{scb}(\gamma_b) = \gamma_b\) |
\(\prod_b \mathrm{Gaus}\left(a_{\gamma_b} = 1\middle|\,\gamma_b,\delta_b\right)\) |
\(\delta_b^2 = \sum_s\delta^2_{sb}\) |
Luminosity |
\(\kappa_{scb}(\lambda) = \lambda\) |
\(\displaystyle\mathrm{Gaus}\left(l = \lambda_0\middle|\,\lambda,\sigma_\lambda\right)\) |
\(\lambda_0,\sigma_\lambda\) |
Normalisation |
\(\kappa_{scb}(\mu_b) = \mu_b\) |
||
Data-driven Shape |
\(\kappa_{scb}(\gamma_b) = \gamma_b\) |
Given the likelihood \(\mathcal{L}(\fullset)\), constructed from observed data in all channels and the implied auxiliary data, measurements in the form of point and interval estimates can be defined. The majority of the parameters are nuisance parameters — parameters that are not the main target of the measurement but are necessary to correctly model the data. A small subset of the unconstrained parameters may be declared as parameters of interest for which measurements hypothesis tests are performed, e.g. profile likelihood methods [intro-1]. The Symbol Notation table provides a summary of all the notation introduced in this documentation.
Symbol |
Name |
---|---|
\(f(\bm{x} | \fullset)\) |
model |
\(\mathcal{L}(\fullset)\) |
likelihood |
\(\bm{x} = \{\channelcounts, \auxdata\}\) |
full dataset (including auxiliary data) |
\(\channelcounts\) |
channel data (or event counts) |
\(\auxdata\) |
auxiliary data |
\(\nu(\fullset)\) |
calculated event rates |
\(\fullset = \{\freeset, \constrset\} = \{\poiset, \nuisset\}\) |
all parameters |
\(\freeset\) |
free parameters |
\(\constrset\) |
constrained parameters |
\(\poiset\) |
parameters of interest |
\(\nuisset\) |
nuisance parameters |
\(\bm{\kappa}(\fullset)\) |
multiplicative rate modifier |
\(\bm{\Delta}(\fullset)\) |
additive rate modifier |
\(c_\singleconstr(a_\singleconstr | \singleconstr)\) |
constraint term for constrained parameter \(\singleconstr\) |
\(\sigma_\singleconstr\) |
relative uncertainty in the constrained parameter |
Declarative Formats
While flexible enough to describe a wide range of LHC measurements, the design of the \(\HiFa{}\) specification is sufficiently simple to admit a declarative format that fully encodes the statistical model of the analysis. This format defines the channels, all associated samples, their parameterised rate modifiers and implied constraint terms as well as the measurements. Additionally, the format represents the mathematical model, leaving the implementation of the likelihood minimisation to be analysis-dependent and/or language-dependent. Originally XML was chosen as a specification language to define the structure of the model while introducing a dependence on \(\Root{}\) to encode the nominal rates and required input data of the constraint terms [intro-2]. Using this specification, a model can be constructed and evaluated within the \(\RooFit{}\) framework.
This package introduces an updated form of the specification based on the ubiquitous plain-text JSON format and its schema-language JSON Schema. Described in more detail in Likelihood Specification, this schema fully specifies both structure and necessary constrained data in a single document and thus is implementation independent.
Additional Material
Footnotes
Bibliography
Glen Cowan, Kyle Cranmer, Eilam Gross, and Ofer Vitells. Asymptotic formulae for likelihood-based tests of new physics. Eur. Phys. J. C, 71:1554, 2011. arXiv:1007.1727, doi:10.1140/epjc/s10052-011-1554-0.
Kyle Cranmer, George Lewis, Lorenzo Moneta, Akira Shibata, and Wouter Verkerke. HistFactory: A tool for creating statistical models for use with RooFit and RooStats. Technical Report CERN-OPEN-2012-016, New York U., New York, Jan 2012. URL: https://cds.cern.ch/record/1456844.
Eamonn Maguire, Lukas Heinrich, and Graeme Watt. HEPData: a repository for high energy physics data. J. Phys. Conf. Ser., 898(10):102006, 2017. arXiv:1704.05473, doi:10.1088/1742-6596/898/10/102006.
ATLAS Collaboration. Measurements of Higgs boson production and couplings in diboson final states with the ATLAS detector at the LHC. Phys. Lett. B, 726:88, 2013. arXiv:1307.1427, doi:10.1016/j.physletb.2014.05.011.
ATLAS Collaboration. Search for supersymmetry in final states with missing transverse momentum and multiple \(b\)-jets in proton–proton collisions at \(\sqrt s = 13\) \(\TeV \) with the ATLAS detector. ATLAS-CONF-2018-041, 2018. URL: https://cds.cern.ch/record/2632347.
Likelihood Specification
The structure of the JSON specification of models follows closely the original XML-based specification [likelihood-2].
Workspace
{
"$schema": "http://json-schema.org/draft-06/schema#",
"$id": "https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json",
"$ref": "defs.json#/definitions/workspace"
}
The overall document in the above code snippet describes a workspace, which includes
channels: The channels in the model, which include a description of the samples within each channel and their possible parametrised modifiers.
measurements: A set of measurements, which define among others the parameters of interest for a given statistical analysis objective.
observations: The observed data, with which a likelihood can be constructed from the model.
A workspace consists of the channels, one set of observed data, but can include multiple measurements. If provided a JSON file, one can quickly check that it conforms to the provided workspace specification as follows:
import json, requests, jsonschema
with open("/path/to/analysis_workspace.json", encoding="utf-8") as ws_file:
workspace = json.load(ws_file)
# if no exception is raised, it found and parsed the schema
schema = requests.get("https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json").json()
# If no exception is raised by validate(), the instance is valid.
jsonschema.validate(instance=workspace, schema=schema)
Channel
A channel is defined by a channel name and a list of samples [likelihood-1].
{
"channel": {
"type": "object",
"properties": {
"name": { "type": "string" },
"samples": { "type": "array", "items": {"$ref": "#/definitions/sample"}, "minItems": 1 }
},
"required": ["name", "samples"],
"additionalProperties": false
},
}
The Channel specification consists of a list of channel descriptions.
Each channel, an analysis region encompassing one or more measurement
bins, consists of a name
field and a samples
field (see Channel), which
holds a list of sample definitions (see Sample). Each sample definition in
turn has a name
field, a data
field for the nominal event rates
for all bins in the channel, and a modifiers
field of the list of
modifiers for the sample.
Sample
A sample is defined by a sample name, the sample event rate, and a list of modifiers [likelihood-1].
{
"sample": {
"type": "object",
"properties": {
"name": { "type": "string" },
"data": { "type": "array", "items": {"type": "number"}, "minItems": 1 },
"modifiers": {
"type": "array",
"items": {
"anyOf": [
{ "$ref": "#/definitions/modifier/histosys" },
{ "$ref": "#/definitions/modifier/lumi" },
{ "$ref": "#/definitions/modifier/normfactor" },
{ "$ref": "#/definitions/modifier/normsys" },
{ "$ref": "#/definitions/modifier/shapefactor" },
{ "$ref": "#/definitions/modifier/shapesys" },
{ "$ref": "#/definitions/modifier/staterror" }
]
}
}
},
"required": ["name", "data", "modifiers"],
"additionalProperties": false
},
}
Modifiers
The modifiers that are applicable for a given sample are encoded as a list of JSON objects with three fields. A name field, a type field denoting the class of the modifier, and a data field which provides the necessary input data as denoted in Modifiers and Constraints.
Based on the declared modifiers, the set of parameters and their constraint terms are derived implicitly as each type of modifier unambiguously defines the constraint terms it requires. Correlated shape modifiers and normalisation uncertainties have compatible constraint terms and thus modifiers can be declared that share parameters by re-using a name [1] for multiple modifiers. That is, a variation of a single parameter causes a shift within sample rates due to both shape and normalisation variations.
We review the structure of each modifier type below.
Normalisation Uncertainty (normsys)
The normalisation uncertainty modifies the sample rate by a overall factor \(\kappa(\alpha)\) constructed as the interpolation between downward (“lo”) and upward (“hi”) as well as the nominal setting, i.e. \(\kappa(-1) = \kappa_{\alpha=-1}\), \(\kappa(0) = 1\) and \(\kappa(+1) = \kappa_{\alpha=+1}\). In the modifier definition we record \(\kappa_{\alpha=+1}\) and \(\kappa_{\alpha=-1}\) as floats. An example of a normalisation uncertainty modifier with scale factors recorded for the up/down variations of an \(n\)-bin channel is shown below:
{ "name": "mod_name", "type": "normsys", "data": {"hi": 1.1, "lo": 0.9} }
MC Statistical Uncertainty (staterror)
As the sample counts are often derived from Monte Carlo (MC) datasets, they necessarily carry an uncertainty due to the finite sample size of the datasets. As explained in detail in [likelihood-2], adding uncertainties for each sample would yield a very large number of nuisance parameters with limited utility. Therefore a set of bin-wise scale factors \(\gamma_{cb}\) is introduced to model the overall uncertainty in the bin due to MC statistics. The constraint term is constructed as a set of constraints with a central value equal to unity, e.g. \(\mathrm{Gauss} (\mu = 1, \sigma_{cb})\), for each bin in the channel. The scales \(\sigma_{cb}\) of the constraints are computed from the individual uncertainties of samples defined within the channel relative to the total event rate of all samples: \(\sigma_{cb} = \sqrt{\sum_s\delta_{csb}}/\sum_s \nu^0_{csb}\), where \(\delta_{csb}\) is the absolute yield uncertainty in each bin.
As not all samples within a channel are estimated from MC simulations, only the samples with a declared statistical uncertainty modifier enter the sum. An example of a statistical uncertainty modifier for a single bin channel is shown below:
{ "name": "mod_name", "type": "staterror", "data": [0.1] }
Warning
For bins in the model where:
the samples nominal expected rate is zero, or
the scale factor is zero.
nuisance parameters will be allocated, but will be fixed to 1
in the
calculation (as staterror is a multiplicative modifier this results in
multiplying by 1
).
Luminosity (lumi)
Sample rates derived from theory calculations, as opposed to data-driven estimates, are scaled to the integrated luminosity corresponding to the observed data. As the luminosity measurement is itself subject to an uncertainty, it must be reflected in the rate estimates of such samples. As this modifier is of global nature, no additional per-sample information is required and thus the data field is nulled. This uncertainty is relevant, in particular, when the parameter of interest is a signal cross-section. The luminosity uncertainty \(\sigma_\lambda\) is provided as part of the parameter configuration included in the measurement specification discussed in Measurements. An example of a luminosity modifier is shown below:
{ "name": "mod_name", "type": "lumi", "data": null }
Unconstrained Normalisation (normfactor)
The unconstrained normalisation modifier scales the event rates of a sample by a free parameter \(\mu\). Common use cases are the signal rate of a possible BSM signal or simultaneous in-situ measurements of background samples. Such parameters are frequently the parameters of interest of a given measurement. No additional per-sample data is required. An example of a normalisation modifier is shown below:
{ "name": "mod_name", "type": "normfactor", "data": null }
Data-driven Shape (shapefactor)
In order to support data-driven estimation of sample rates (e.g. for multijet backgrounds), the data-driven shape modifier adds free, bin-wise multiplicative parameters. Similarly to the normalisation factors, no additional data is required as no constraint is defined. An example of an uncorrelated shape modifier is shown below:
{ "name": "mod_name", "type": "shapefactor", "data": null }
Data
The data provided by the analysis are the observed data for each channel (or region). This data is provided as a mapping from channel name to an array of floats, which provide the observed rates in each bin of the channel. The auxiliary data is not included as it is an input to the likelihood that does not need to be archived and can be determined automatically from the specification. An example of channel data is shown below:
{ "chan_name_one": [10, 20], "chan_name_two": [4, 0]}
Measurements
Given the data and the model definitions, a measurement can be defined. In the current schema, the measurements defines the name of the parameter of interest as well as parameter set configurations. [2] Here, the remaining information not covered through the channel definition is provided, e.g. for the luminosity parameter. For all modifiers, the default settings can be overridden where possible:
inits: Initial value of the parameter.
bounds: Interval bounds of the parameter.
auxdata: Auxiliary data for the associated constraint term.
sigmas: Associated uncertainty of the parameter.
An example of a measurement is shown below:
{
"name": "MyMeasurement",
"config": {
"poi": "SignalCrossSection", "parameters": [
{ "name":"lumi", "auxdata":[1.0],"sigmas":[0.017], "bounds":[[0.915,1.085]],"inits":[1.0] },
{ "name":"mu_ttbar", "bounds":[[0, 5]] },
{ "name":"rw_1CR", "fixed":true }
]
}
}
This measurement, which scans over the parameter of interest SignalCrossSection
, is setting configurations for the luminosity modifier, changing the default bounds for the normfactor modifier named mu_ttbar
, and specifying that the modifier rw_1CR
is held constant (fixed
).
Observations
This is what we evaluate the hypothesis testing against, to determine the compatibility of signal+background hypothesis to the background-only hypothesis. This is specified as a list of objects, with each object structured as
name: the channel for which the observations are recorded
data: the bin-by-bin observations for the named channel
An example of an observation for a 2-bin channel channel1
, with values
110.0
and 120.0
is shown below:
{
"name": "channel1", "data": [110.0, 120.0]
}
Toy Example
{
"channels": [
{ "name": "singlechannel",
"samples": [
{ "name": "signal",
"data": [5.0, 10.0],
"modifiers": [ { "name": "mu", "type": "normfactor", "data": null} ]
},
{ "name": "background",
"data": [50.0, 60.0],
"modifiers": [ {"name": "uncorr_bkguncrt", "type": "shapesys", "data": [5.0, 12.0]} ]
}
]
}
],
"observations": [
{ "name": "singlechannel", "data": [50.0, 60.0] }
],
"measurements": [
{ "name": "Measurement", "config": {"poi": "mu", "parameters": []} }
],
"version": "1.0.0"
}
In the above example, we demonstrate a simple measurement of a single two-bin channel with two samples: a signal sample and a background sample. The signal sample has an unconstrained normalisation factor \(\mu\), while the background sample carries an uncorrelated shape systematic controlled by parameters \(\gamma_1\) and \(\gamma_2\). The background uncertainty for the bins is 10% and 20% respectively.
Additional Material
Footnotes
The name of a modifier specifies the parameter set it is controlled by. Modifiers with the same name share parameter sets.
In this context a parameter set corresponds to a named lower-dimensional subspace of the full parameters \(\fullset\). In many cases these are one-dimensional subspaces, e.g. a specific interpolation parameter \(\alpha\) or the luminosity parameter \(\lambda\). For multi-bin channels, however, e.g. all bin-wise nuisance parameters of the uncorrelated shape modifiers are grouped under a single name. Therefore in general a parameter set definition provides arrays of initial values, bounds, etc.
Bibliography
Histfactory definitions schema. Accessed: 2019-06-20. URL: https://scikit-hep.org/pyhf/schemas/1.0.0/defs.json.
Kyle Cranmer, George Lewis, Lorenzo Moneta, Akira Shibata, and Wouter Verkerke. HistFactory: A tool for creating statistical models for use with RooFit and RooStats. Technical Report CERN-OPEN-2012-016, New York U., New York, Jan 2012. URL: https://cds.cern.ch/record/1456844.