The
LHC Olympics Black Box Page ----- Winter 2006 Version
Postscript: the Winter Olympics were by
all accounts a rousing success:
- Great pedagogical lectures by Tao Han and Michelangelo
Mangano
- Two solutions to the Washington black box
- A solution to the Michigan black box (see also
last summer's solution.)
- Two partial solutions to the Harvard black box
- Opening of the three black boxes
- Additional presentations on model independent analysis,
on involving students in the Olympics, and on the detector simulation
program PGS
- Discussion of plans to continue the Olympics workshops,
with the next workshop involving additional pedagogical lectures and
black boxes that will most likely include
- an easy signal
suitable for beginners
- a more challenging
signal for those who feel ready
- an easy signal inside
of a moderately realistic standard model background --- the next level
of difficulty
Stay Tuned: The Next
Olympics will
be August 24th-25th at the KITP in Santa Barbara.
This page is now retired; all
additional updates can be found here.
WARNING
Much of the information on this page about the black boxes is now
obsolete.
All of the information that is not obsolete is now located on the new
Wiki,
which you can access from here.
Welcome!! At
this site you will find the new black boxes and
calibration samples
for the LHC Olympics!
The goal of the LHC Olympics data challenge is for participants to
try to figure
out what is in each black box --- what underlying model has generated
this data?
Why participate? There
is no prize for winning --- this is not a competitive exercise! [well,
not particularly competitive anyway.] Instead,
this is supposed to
be fun and instructive at the same time! The black box analyses are only a means
to an end.... The real goals here
include individual and community preparation for the LHC era, and the
invention of new tools and techniques which will be valuable for
experimentalists and theorists when the real data starts to
arrive. Indeed,
we hope
there will be many "winners", with different approaches to the data.
For the February 2006 round of the LHC Olympics (to be held at
CERN February 9th and 10th, 2006),
each black box
contains simulated data that would be generated at LHC by a new
physics model, processed through a simulation of
an LHC-like detector; note the black
boxes contain no standard
model
backgrounds. For each black box, we have provided
- plots and
tables that are drawn from
the data, such as might be presented at LHC conferences which theorists
would attend, and
- the data files
listing the objects in all the events, such as the experimentalists
will often use when doing a rough data analysis
Please Note: Suggested Guidelines for Participants in the LHC OLYMPICS
So
that the LHC Olympics can be a useful and enjoyable exercise for people
with a wide variety of backgrounds, we suggest that no public
announcements of "solutions" to the data challenge be made available in
advance of our February conference --- it would spoil the fun!
However, other forms of communication, such as
- private consultation between participant groups, involving the
sharing of software, discussion of strategies for approaching the data,
etc.,
- unpublished presentations
after
the February conference, such as websites, or publicly available
powerpoint files or written reports,
- publishable papers on new tools and new approaches that have been
motivated by a black box analysis,
are fine and indeed encouraged, as they will contribute to the success
of the LHC Olympics effort.
Thanks!!! [from the organizers]
Here
they are:
NOTE
THE FORMAT FOR THESE OLD FILES DIFFERS FROM THOSE OF THE THIRD OLYMPICS
AND BEYOND!!!
The black boxes and calibration
samples are in the form of large data files. Each has its own
website where you can learn more about it, and where you can look at
plots of some distributions and tables of some inclusive
signatures. You can also download the data and do some analysis
yourself.
Below you can find general
information that applies to all of the black boxes --- or more
specific information on how to use the
plots and tables that the
black box creators
have provided,
how to interpret and use the data
files, more details
on
how the data files were generated, and features/issues
with the detector
simulation.
I. Black box
"classic", which contains 20
[Not! see below] inverse
femtobarns of data generated with the
same [Not! see below] physics
model as was used in the summer 2005 data challenge, which
had only
2 inverse femtobarns of data. (See the important WARNING
below.)
[See the new warning, as of 2/04/06, on the following link.] The black box raw data
files, plots and tables extracted from the data,
and creators' comments can be found HERE.
II. Black box "uw1",
which contains 25
inverse femtobarns of data of a new
signal. The black box raw data files, plots and tables extracted
from the data, and creators' comments can be found HERE.
III. Black box
"harvardbb", which contains two
sets of files, one with 5
inverse femtobarns of data of a new
signal, and one with 40 inverse femtobarns of the same signal.
The black box raw data files, plots and tables extracted from the data,
and creators' comments can be accessed
HERE.
The calibration samples are
A. a pure ttbar sample.
B. a diboson sample (WW, WZ, and ZZ
production.)
Comments on how to use calibration samples are HERE.
The Plots and Tables
---
The creators of the black boxes have provided some information about
"inclusive signatures", including plots of "kinematic
distributions" and tables that show the numbers of events with certain
characteristics.
What is an inclusive signature?
Inclusive signatures are basically anything that can be
measured from the full aggregate data (including all types of events)
generated through a new physics model. This is to be understood
as opposed to what one might
like to measure (masses of
individual new particles, branching fractions of individual new
particles) but which may not be possible to extract experimentally,
particularly at a hadron collider. For example, it is relatively
easy to measure the mass of a resonance, such as a Z' that decays
to a pair of jets or a pair of Z bosons. But in models, such as
supersymmetry, where the decay products of most new particles include
an invisible particle that escapes the
detector, such mass measurements are not in general possible. In
such cases, for example, events in which gluinos are produced cannot
cleanly be separated from events where squarks are produced; it is only
possible to add all these events together and inclusively consider the
resulting signals.
A kinematic distribution, in
its simplest form, is simply a plot of the number of objects, or
object-pairs, or events, as a function of a simple kinematic
variable.
For instance, one might want to know the number of events containing an
electron with transverse momentum equal to p_T. (Why?
Because it gives a measure of how much energy is being dumped, by a new
physics process, into its decay products, among which are the electrons
which the detector is seeing.) So do exactly that --- plot the
number
of events as a function of p_T. Of course, to make the plot, one
has to make some decisions (and to interpret the plot, you have to know
what its creator decided.) When
you plot the number of events versus the p_T of all electrons, do you
include events with one and
only one electron? or events with two electrons? do you include
events with both an electron and a muon? What angular range can
the detector measure electrons in? Should you trim
the edges of this angular range to avoid areas where the detector might
make mistakes in its momentum measurements? etc.
These subtleties can be very
important for your interpretation of the plots, and you should read the
captions closely.
A classic example of a simple kinematic distribution is a plot of the
number of events with a muon and an antimuon versus the invariant mass
of the muon-antimuon pair. A plot of this distribution in the
standard model will show a huge peak near the mass of the Z boson, as
well as a peak at the J/Psi charmonium resonance. A key question
to ask of a signal is whether there are Z bosons being generated either
in association with or in the decays of whatever new particles are
being produced --- and a simple way to answer it is to make this
plot. (However, you should keep in mind that in the real LHC the
standard model will infect all signals, so you'll always see some Zs;
the question of determining whether the Zs are all from standard model
processes or whether some come from new physics processes will be a
very challenging one.) Other important questions can be answered
using this plot; see for example Hinchliffe and Paige's or Baer and
Tata's approach to
supersymmetric models (and note that their methods don't require
supersymmetry and are much more widely applicable.)
Other classic distributions include: the number of events versus the
missing tranverse energy in the events, or versus the sum of the
magnitudes of the tranverse momenta of all high-momentum objects in the
event, or versus the number of b-tagged jets in the events.
Tables that show numbers of events of
a certain class are even
simpler. For example,
how many events in the sample contain two positively charged muons and
no other leptons? How many contain two positively charged muons
and at least three jets? How many contain two positively charged
muons each of which has at least 100 GeV of transverse momentum?
etc. Tremendous amounts of information lurk in these basic
numbers... if you can find it.
However, one must be very careful! The detector does not detect
electrons, muons and taus with the same "efficiency" (because of
detector response to these particles and because of the isolation cuts
imposed, which will differ.) Also, taus that decay leptonically
count as electrons and muons, though the electrons and muons in the
decay have somewhat lower energy than the parent tau (which affects the
above question we asked above: How many events contain two positively
charged muons each of which has at least 100 GeV of transverse
momentum?) The number of jets depends on how you define a "jet". The number of b-tagged jets will depend
on how b-tagging works. And for any of
these tables, the number of anything you'd like to count depends on how
the detector selects events to store on file --- "triggering". This sounds very
complicated; what's a naive theorist to do?
The key, of course, is to ask questions where these complications
cancel. Certain ratios are less sensitive than others to these
details, for instance. Correlations between the answers to
certain questions may not be sensitive to these details. Learning
how to ask the right questions, questions whose answers have content
and small sytematic uncertainties, is part of learning how to interpret
the data from a hadronic collider such as the LHC.
By looking at the plots and tables, you should be able to extract a lot
of information about the physics generating the signal. You may
find, however, that you want more information than we've
provided. At this point you may just want to stop and wait for
the next LHC Olympics conference, or you may want to find an friend who
is
willing to "play experimentalist", study the data files, and provide
you with the additional information that you'd like to
have. Collaborations of formal theorists and either
phenomenologists or experimentalists are very effective in this
regard! Or you can try to play with the data
files yourself. You may wish to write your own software to do
this, or you may wish to learn to use ROOT,
or you might want to try a user-friendly
software package, especially designed by the Harvard group for
black boxes (though please note
it has not yet been fully vetted by the LHC Olympics committee and
should be used with appropriate caution.)
The Data Files ---
How to read them:
WARNING: THIS APPLIES ONLY TO THE DATA FILES FOR THE FIRST AND SECOND
OLYMPICS!!! SOME IMPORTANT DETAILS HAVE CHANGED. ALL
CORRECT INFORMATION HAS BEEN REPRODUCED IN THE NEW WIKI, WHICH YOU CAN
ACCESS FROM HERE.
The data files
in the black box and calibration samples are ordinary text files with
rows of numbers; you can just read them by eye, without any conversion
software. Interpreting them by eye is also
straightforward. Here's how the files work.
The files consist of a long list of "events", individual proton-proton
collisions that have generated a spray of particles inside the
detector. Each "event" represents
the detector's output (very crudely, and considerably processed) in a
particular proton-proton collision that happened to produce something
sufficiently interesting that it merited storing permanently. [How does the
detector decide what is "sufficiently interesting"? This is the
crucial issue of triggering!]
Each event consists of a set of rows in the data file. Each row
corresponds to an "object" [a lepton, photon, jet, or missing
transverse
momentum.]
- The first column of each row is just a counter that labels the
object. When
the label reverts to "1", this indicates the previous event is complete
and a new event is being listed.
- The second column of each row gives the type of object being
listed [0,1,2,3,4,6 = photon, electron, muon, hadronically-decaying
tau, jet, missing transverse
energy].
- The next three columns give the pseudorapidity,
the azimuthal
angle, and the transverse momentum of the object.
- The next column gives the invariant mass
of the object, if it is
a jet, or its charge, if the object is not a jet.
- The next column gives some additional information about the
object listed HERE.
- The final column is 0 unless the object is a jet that has
been "tagged" as probably containing a heavy quark, in which case it is
1. More information about heavy-flavor tagging is HERE.
A example of a top-antitop pair production event, with the top decaying
semileptonically and the antitop decaying hadronically.
1 2 -1.419
2.873 24.94
1.00 0.0
0.0 an
isolated muon, positively charged, with 25 GeV of transverse momentum
2 4 -0.804 2.307
130.99 16.14 10.0
1.0 a heavy-flavor jet
(presumably a b quark jet) with 131 GeV of transverse momentum, an
invariant mass of 16 GeV, and 10 charged tracks
3 4 1.046
4.245 82.75
14.11 2.0
0.0 an ordinary
jet with 83 GeV of transverse momentum,
an invariant mass of 14 GeV, and 2 charged tracks
4 4 1.247
5.996 78.72 13.75
14.0 1.0 a
heavy-flavor jet (presumably a b quark jet) with 79 GeV of transverse
momentum, an invariant mass of 14 GeV, and 14 charged tracks
5 4 -2.154
3.884 13.85
5.83 3.0
0.0 an
ordinary jet with 83 GeV of
transverse momentum, an invariant mass of 6 GeV, and 3 charged tracks,
at a very small angle to the beampipe
6 6 0.000
6.245 92.14
0.00 0.0
0.0 the
"missing transverse energy" in the event, 92 GeV, from a combination
of the muon neutrino in the event and possible mismeasurements
Photons
A photon is detected as energy in the electromagnetic calorimeter, with
no high-transverse-momentum track, and little energy in the hadronic
calorimeter. Isolation cuts are used to reduce
backgrounds, such as a
pi-zero decaying to photon pairs, or an electron if its charged track
is
missed.
Electrons
An electron is detected as energy in the electromagnetic calorimeter,
with a high-transverse-momentum track pointing toward it, and little
energy in the hadronic
calorimeter. Isolated charged pions
can give electron-like signals, as can photons or neutral pions if a
charged pion happens to point in the same direction and happens not to
leave much energy in the hadronic calorimeter. Electron isolation cuts are used to reduce
backgrounds and remove electrons from heavy-flavor
decays.
Muons
A muon leaves little energy in the calorimeters, has a track, and
travels all the way to the muon-detection system outside the
calorimeters. Muons are rarely faked, though "punchthrough",
where a hadron fails to leave all its energy in the hadronic
calorimeter and punches through to the muon system, giving a fake muon,
can be a problem in some circumstances. Muon isolation
cuts are used to reduce backgrounds and remove muons from heavy-flavor
decays.
Hadronically-Decaying
Taus
Tau leptons decay about 1/3 of the time to either an electron or a muon
plus neutrinos. In this case, they cannot be distinguished from
electrons or muons and appear in the detector as objects of electron
and muon type.
The rest of the time taus decay to quarks plus a neutrino. The
quarks immediately turn into hadrons. Because of the small mass
of the tau, almost all of the tau's decays are to a pair of light
quarks. More precisely, the most common decays of the tau are to
a neutrino plus
- a charged pion
- one charged pions and one or two pi-zeros (each of which decays
to two photons)
- two pions of one charge and a third of the opposite charge
In the first two cases a single charged track, but one that leaves
energy in the hadronic calorimeter and is clearly not an electron, is
the result --- a "1-prong" tau. Any hadronic or
electromagnetic energy is clustered
in a very narrow cone surrounding the charged track (more precise
statements accounting for the curvature of the track are unimportant
here.) In the second case, three tracks result --- a "3-prong"
tau. Thus, what
appears in the detector is a very narrow jet, with invariant mass no
greater than 2 GeV, and with 1 or 3 tracks. Such an object is
unlikely to be an ordinary QCD jet (fake rates are not small however)
and a reasonably large number of taus will look like this (so
efficiencies are not bad.) No one really knows what the fake
rates and efficiencies will actually be at the LHC, though tau
detection is expected to be quite good because of the excellent spatial
resolution of the calorimeters. Detailed detector simulations are
currently underway by the CMS and Atlas collaborations and should help
clarify this issue.
Jets
Jets are the most common and most problematic objects in hadronic
collider physics. We cannot do proper justice to this extremely
complex problem here. See for examples [we will add
references.] Crudely, jets are defined to be, well, jets of
particles (as measured through tracks and through energy in both
calorimeters) that fit inside a cone (in azimuth and pseudorapidity) of
R=0.7, where R is the sum in quadratures of
the azimuthal
angle and the pseudorapidity away from the centroid of the
jet. However, this is completely ambiguous; a precise
algorithm for defining jets, and resolving ambiguities, is needed.
NOTE THE FOLLOWING
INFORMATION APPLIES ONLY TO THE FIRST AND SECOND OLYMPICS
For the PGS detector simulations used here, we have chosen
the
following algorithm (this is neither ideal nor a long-term solution,
but it will have to do for now...)
Jets are defined using a cone
algorithm centered on the highest Et tower (cell in eta-phi space) or
"seed", i.e. cells within R=Conesize of this "seed" are included in the
jet. Once such a jet is defined, the center of the jet may no longer be
the "seed" tower. Treating each cell within the jet as a massless
particle, a jet 4-vector is defined. If two jet 4-vectors are
separated by less than R=Conesize, then these jets are merged into one
jet. An artificial shoulder may appear in delta-R distributions as a
result. Currently "Conesize" = 0.7, which is a common choice of jet
theorists.
Missing
Transverse Energy
What is meant by this term? And how precisely
is it defined in the context of the PGS detector?
"Missing transverse energy" is not missing energy; indeed, what is
"tranverse energy"?! Precisely stated, it is the magnitude of the
missing transverse momentum in an event.
Energy conservation cannot be used in a hadronic collider, because so
much energy is carried off in unmeasurable particles --- remnants of
the shattered initial protons --- inside or very near the
beampipe. For the same reason, momentum conservation along the
beampipe cannot be used. However, momentum conservation
transverse to the beampipe should work. A failure of momentum
conservation in the transverse plane suggests
- the presence of a neutrino, or neutrinos, or new undetectable
objects that have carried off momentum invisibly
- a mismeasurement of the energy of a jet, which is a common
occurrence
- a particle sneaking through a crack in the detector structure, or
otherwise evading detection for a technical reason
However, experimentally there is more than one way to define the
missing momentum, because there are multiple measurements of momentum
and they don't always agree. Here is what we do, using our
version of PGS:
Missing-Et is defined by summing
(as a vector) the directed transverse energy deposited in all of the
calorimeter cells (treating each cell as a massless particle) --- this
combines, ideally, the momenta of all photons, electrons,
hadronically-decaying taus, and jets --- and adding to this the
transverse momenta of any muons, whose energy is measured using the
muon detection system. The magnitude of the resultant
vector is the "missing transverse energy".
A caution: muon detection works only
out to |pseudorapidity|=2, whereas the calorimeter extends to |pseudorapidity|=4, so
muons at large |pseudorapidity| (very near the beampipe) can cause
additional missed transverse momentum!
Kinematics
Here we define some of the key kinematic variables used above:
- pseudorapidity --- the
quantity "eta" is not quite the same as the rapidity; it is a stand-in
for "theta", the latitude in spherical coordinates relative to the beam
axis. In particular, eta is defined to be
-ln[tan(theta/2)]. For massless particles, eta is the same as the
rapidity y, defined in terms of the energy E and z-component of
momentum pz as y=1/2[ln([E+pz]/[E-pz])].
- azimuthal angle --- the
angle "phi" is the angle around the z axis in cylindrical coordinates,
or the longitude in spherical coordinates, with the beam axis oriented
along the z axis.
- R --- a distance measure
in (eta,phi) space --- if two particles have momenta pointing in the
directions eta1,phi1 and eta2,phi2, respectively, then their distance
in R is Sqrt[(eta2-eta1)^2+(phi2-phi1)^2]
- transverse momentum ---
the px and py components of the momentum of an object, with the beam
axis being the z direction.
- invariant mass --- the
square of the sum of the four-momenta of two (or more) objects.
If a particle of mass M decays to n objects, the invariant mass of the
n objects will be M. In the case of a jet, the cells of the
calorimeter that lie within the jet-defining cone are viewed as having
detecting the energy and direction (and therefore the momentum) of
little massless mini-objects. The invariant mass of the jet is
the square of the sum of all the mini-object four-momenta. Note
this is generally much larger than the invariant mass of the charged
particles (detected as tracks) in the jet, and much larger than the
mass of the quark which generated it.
Column
Seven
NOTE THE FOLLOWING
INFORMATION APPLIES ONLY TO THE FIRST AND SECOND OLYMPICS
The information stored in this column is mostly for
advanced
users. "R" is defined here.
- if the object is a jet, it gives the number of charged tracks
associated with the jet (the tracks within a cone of radius R=0.7)
- if the object is an electron or a photon, it gives the number of
tracks within a cone of radius R=0.15 centered on the electron or
photon.
- if the object is an muon, it gives the number of tracks within a
cone of radius R=0.4 centered on the muon
- if the object is a hadronically-decaying tau, it again indicates
the number of charged tracks associated with the tau, within a cone of
radius R=0.175
- otherwise it is zero
All of the above tracks must be above a threshold of 1 GeV. Look at the
PGS code for more details.
Heavy
Flavor Tagging
Sometimes this is called "b-tagging", since the main goal of tagging is
usually to detect bottom quarks, but in fact significant numbers of
charm quarks get detected this way also. The key feature of
bottom and charm quarks is that they both live just long enough to
usually decay at a measurable distance from the initial collision
point. When a hadron containing a bottom or charm quark decays
after travelling a few millimeters from the collision point, the
charged particles created in the decay can form a "displaced vertex",
or at the very least, they do not point back to the collision point ---
they have a nonzero "impact parameter". The decays also can
produce muons (which are harder to fake than electrons, so they are
preferentially used) which are close to the jet. The observation
within a jet of a displaced vertex, tracks with nonzero impact
parameter, and/or a single muon all give evidence that a heavy
quark was somewhere in the jet.
It is expected that about 50 (15) percent of jets containing
bottom (charm) quarks will be "tagged" at LHC, while about 1 percent of
other jets are tagged by accident --- "mistags".
However, one cannot take these numbers at face value. First,
adjustments in the tagging algorithm can increase or decrease all three
tagging rates; certain analyses may need very pure samples,
demanding very "tight" tagging requirements, whereas others may need
high statistics, in which case "loose" requirements would be
used. Second, a single number is not a proper estimate of a
tagging rate; the tagging
probabilities for bottom, charm, and non-heavy-flavor jets are
dependent
on where the jet's transverse
momentum and pseudorapidity (among other things, such as the
luminosity.)
NOTE THE FOLLOWING
INFORMATION APPLIES ONLY TO THE FIRST AND SECOND OLYMPICS
The PGS detector used in the
black box samples was set to have tagging probabilities of the
following type:
- for a jet containing a bottom hadron, the probability of tagging
= 0.57 tanh(Et / 36.05) * 1.1
- for a jet containing a charm hadron, the probability of tagging =
0.173 tanh(Et / 42.08) * 1.1
- for a jet containing neither, the probability of tagging =
Min[Et, 100] /10000
where "Et" is the magnitude of the jet's "transverse energy"
(constructed from the transverse momentum and the invariant mass of the
jet.) The additional factors of 1.1 accounts for using detected
soft leptons from the heavy-flavor decays to boost the tagging
efficiency. (There is a subtlety concerning the
somewhat crude
way that PGS implements b-tagging which lowers this rate slightly; see
our detailed discussion of PGS
itself.) Tagging efficiency falls off rapidly for jets near the
beampipe; in our current implementation of PGS, no jet at
|pseudorapidity| > 2 can be tagged.
Note also that using "1 percent" to describe the probability
of mistags is inherently ambiguous. High-energy gluons can
quite often produce charm or bottom quark pairs as they form a
jet. When this happens, the term "mistag" is unclear --- when the
experimentalists say they have a 1 percent mistag rate, does it
include this effect? The answer is no. Experimentalists mean that
the probability of mistagging the gluon jet when it does not split into charm or bottom
is 1 percent. The overall tagging probability for a gluon is
probably closer to 3 percent, 1 percent each for a mistag, a split to
charm followed by a tag, or a split to bottom followed a tag.
While this naively seems like a small effect, it is sometimes very
important.
Lepton
Isolation:
Leptons are a very important sign of potential new physics, since,
naively, QCD processes don't generate leptons. But this isn't
really true. Jets generate leptons, or apparent leptons, in
several ways. First, a charged pion overlaid on a pi-zero, which
decays to two photons, can look just like an electron: a track with
electromagnetic energy. Most of the time there's hadronic energy
too, which disfavors identifying this as an electron, but fluctuations
happen and sometimes the hadronic energy isn't registered. So we
get a "fake" electron. It's harder to fake muons, but not
impossible. Of course, a fake electron will generally be inside,
or near, a jet, since other hadrons typically will accompany the
pion-fake-electron. So if we demand the electron be isolated ---
that there be
no nearby tracks or energy in the calorimeter --- we are probably
looking at a real electron. Probably.
Another way to get an electron or muon is from the production of a
bottom or charm quark, which has a certain probability of decaying to a
lepton. Such a lepton typically is also inside a jet formed from
the rest of the shower of particles that are created as the bottom
quark discovers it is confined. But the kick from the bottom
quark decay tends to knock the leptons out of the jets a little bit,
and occasionally they will be isolated enough to be indistinguishable
from "prompt" leptons from W bosons, or Z bosons, or other new
sources. Again, an isolation requirement reduces, though it
does not eliminate, the chance of mistaking a prompt lepton from one
that comes from a nearby heavy quark jet.
Lepton isolation requirements are generally different for electrons and
muons, and in any case the efficiency with which a detector detects
muons and electrons will be different. Do not expect the numbers
of muons and electrons to be equal, even within a standard model
calibration sample! Instead, you need to learn something about
how the lepton isolation efficiency affects signals in order to draw
correct conclusions about the underlying physics.
NOTE THE FOLLOWING
INFORMATION APPLIES ONLY TO THE FIRST AND SECOND OLYMPICS
The current version of PGS used in these blackboxes has a new lepton
isolation criterion compared to the PGS used for summer 2005's black
box, and EXPECT THIS TO CHANGE AGAIN WHEN THE NEW
VERSION OF PGS BECOMES AVAILABLE.
Black
Box General Information
These types of blackboxes and the calibration samples are generated in
three steps (through a single cross-linked computer program, which will soon be
available to participants --- see below.)
1) Feynman diagrams (matrix elements) are calculated to obtain the rate
for a particular short-distance physics production process, such
as quark-antiquark annihilation into two photons [this can be done with
CompHEP
or Pythia
or Herwig
or MadGraph/MadEvent or ALPGEN or other matrix
element programs.] A caution about such data generation can be
found HERE.
2) the short distance physics is "evolved" to long-distance physics,
accounting for the conversion of quarks and gluons into jets of
hadrons, decays of tau leptons, and other processes of importance [this
is done, for these blackboxes, with Pythia
6.324, though other programs
including Herwig
are available for this purpose] Problems with
this stage are commented on HERE.
3) the resulting hadrons and lepton and photons are run through a
program called "PGS"
(Pretty Good Simulation), written by John Conway
(UC Davis) which serves as a
simulated detector. Jet reconstruction and lepton identification are
done at this stage. The output of PGS is the blackbox data, or
calibration sample data, that you are downloading. See the
warning BELOW. The
data files can be
read by eye and are easily
interpreted.
Some Very Important Supplemental
Information
- Information about lepton
isolation requirements is HERE.
- Information about "tagging"
of jets including heavy flavor quarks is HERE.
- Information about the "detector"
which PGS simulates is HERE.
- Information about triggering
--- what it is, why it matters, why you care, and why
you cannot do an full analysis if you don't understand how it was done
for these samples --- is HERE.
Some Aspects of Event Generators
Most event generators are wonderful for some things but have
significant limitations for others. Some are very easy to use and
convenient to run, but will only do 2 -> 2 processes in the main
scattering event (which leaves out many important 2 -> 3 and 2 ->
4 processes that can be important standard model backgrounds to new
physics signals; for example g g -> t tbar Z is a source of large
missing energy and leptons.) Many cannot handle the cascade
decays of new particles correctly; they may fail completely (because
the phase
space integrals required simply take too long) or they may simply
discard some
important information (such as the correlations between the spins of
the new particles and how those correlations propagate into the decay
products.) Some generators that can handle these issues pretty
well are
unfortunately harder to modify to accomodate new-physics
processes. There is no simple solution here --- it is necessary
to understand both the generator you
are using and the physical processes (signal and background) that you
are simulating, in order to avoid very significant errors.
Moreover, even if your event generator correctly computes tree level
amplitudes, this doesn't mean it does the physics right. Loop
corrections are huge in QCD (more precisely, without a loop correction,
tree amplitudes suffer from large ambiguities, since they are
proportional to a power of a running coupling, whose value is not
determined at tree level!) This can be very roughly dealt with,
process by process, by normalizing the rate for each process using a
next-to-leading order computation of that rate and hoping the
tree-level result is still giving the correct kinematic
distributions. But this is not practical for simulating many
processes at once, since typically event generators are not written in
such a way that you can easily adjust the normalization of each process
by hand. One should also remember that parton distribution
functions are needed for predicting the rate of any given process, but
these functions are neither perfectly determined from experiments
(especially gluon and heavy quark distributions, which are important at
LHC) nor free from effects of loop corrections. So don't take any
one of our black box data files too seriously --- our simulation of the
signal from a new physics model is not, for these and other important reasons, what would actually be seen at LHC if this
model were a correct description of the real world. The errors
are very hard to quantify without a detailed study of both the signal
and standard model backgrounds.
Another modern effort in event generation involves the MC@NLO
project (Monte Carlo at Next-to-Leading Order).
Some
Aspects of Showering and Hadronization
[to be added]
Some
Aspects of the PGS Detector
[to be added]
Triggering
Changed
12/07/05: Thank you to Patrick Meade, Csaba Csaki, Christian Spethmann
and others at Cornell for their questions, studies and comments.
NOTE THE FOLLOWING
INFORMATION APPLIES ONLY TO THE SECOND OLYMPICS
What is triggering? Why is
it necessary? How is it implemented [very, very crudely!] here?
The rate of collisions at the Tevatron or the LHC is many orders of
magnitude too large for a record of every collision to be stored.
The detectors are so enormous, with so many data channels, that to
store the record of a single collision requires a stunning amount of
memory. Moreover, recording the events takes time. Roughly
100,000,000 events per second occur, but only about 100 of these can be
recorded.
So how do experimentalists decide how to select 10^2 out of 10^8 events
each second? The detector must contain an elaborate "trigger" as
part of its hardware and software which does a partial analysis of each
collision to decide whether it is sufficiently interesting to be worth
recording.
For instance, if an event has a muon in it, it has a good chance of
being interesting. If there is a large amount of missing
transverse momentum, it has a good chance of being interesting.
If there are several jets with a TeV of transverse momentum, it's
interesting. Etc. So the trigger consists of a set of
conditions: if an event appears to satisfy one or more of these
conditions, the detector software will trigger a full readout of the
detector data. Otherwise, the event is dumped into the void, and
lost forever.
Triggering is all about compromise. We can't record all events with candidate electrons
in them without having to throw away some
events that have large missing transverse momentum --- there are just
too many. So we require any interesting electrons to have some
minimum amount of transverse momentum... unless, say there are two
leptons in the event, which is more rare, in which case we can lower
that minimum, or, say, the event has both an electron and a substantial
amount of missing tranverse momentum, which is also rare, in which case
again the minimum could be lowered. QCD produces huge numbers of
events with high-p_T jets, so we can't record all events with high-p_T
jets without having no storage space for events with muons. So we
might demand that an event, in order to be stored, have at least three
jets that satisfy a condition: one jet must have at least 650 GeV of
p_T, the next must have at least 300 GeV, then next at least 150
GeV. Etc.
Now here's the problem: this means that the detector records events
that pass a rather complicated set of conditions. Even ignoring
the fact that the detector trigger decisions are imperfect, this makes
for a very complex situation. For instance, we cannot easily ask
how many events in a new signal have a muon compared to how many events
have no leptons. Or rather, we can ask it, but it doesn't tell us
anything, because the trigger conditions for events with muons and for
events with no leptons are completely different, and the effect of this
difference is very hard to estimate unless, in addition to
understanding your detector very well, you have a precise and detailed
model of the new physics.... which was precisely what we were trying to
construct in the first place! So the problem is circular, and
very difficult to solve.
Triggering is so complicated that we have decided not to address it
properly yet in our LHC Olympics workshops. On the other
hand,
not to do any triggering at all is to be misleading to the point of
silly... we'd end up keeping all sorts of events which would not even
be written down for storage by the LHC!
NEW 12/07/05: The original description of our triggering
prescription was incorrect, due to a misunderstanding of the
agreed-upon procedure between the author of the website and the
executor of the code. Apologies! Our original intention was
to do something simple and very crude, but at least not totally
unrealistic. Instead, what we have actually done is more
realistic than intended (still crude) and much less simple. (At
least it isn't less realistic and less
simple.) We are not providing participants currently with
sufficient information to understand the effect of the trigger, and we
are working to improve the situation and are discussing workarounds
with experts at the present time. Actually the whole issue is
quite interesting and instructive, and we may open up a page on the
wiki for further discussions amongst both organizers and participants.
Triggering involves a decision that must be made very rapidly.
Real detectors have to therefore make these decisions based on partial
and incomplete and often erroneous information. This means that
interesting events are sometimes missed by the trigger, and conversely,
events which seem interesting a first glance may turn out to be less so
after being more carefully analyzed. For instance, an initial
look at the calorimeter may reveal a narrow isolated cluster of energy
in the hadronic calorimeter that hints at being a tau. To check
if it is likely to be a tau, the triggering system looks to see if
there are a small number of tracks in the vicinity (one or three would
be expected.) But because of the time available, the detector
will reconstruct tracks quickly, using only two projected dimensions
(radius and azimuthal angle phi) of the three-dimensional tracking
information. This allows two types of errors, with either sign:
(a) the detector may fail to reconstruct a track which is actually
present, for example because of tracks which are superposed and
crossing when projected onto radius and phi, or (b) the detector will
see a track that points at the tau-candidate cluster, but later, with
more time for track reconstruction, this turns out to be a coincidence:
although the track has the same phi as the cluster, its eta
(psuedorapidity) is completely different. Because of these
errors, or more precisely, inefficiencies and fakes, an event with a
tau may be thrown away, or an event without a tau may be kept, on the
basis of the triggering process.
Thus it is essential to distinguish between trigger objects (the imperfectly
reconstructed electrons, muons, jets, etc. on the basis of which the
trigger decision is made) and reconstructed
objects (which are the objects that, having been carefully
reconstructed, appear in the data files.) The original intention,
to keep things simple, was to base a crude pseudo-trigger on the
reconstructed objects, which would make it easy for participants to
understand why one event or another passes the trigger. However,
in fact, the PGS used in the current Olympics data set bases its crude
pseudo-trigger on the trigger objects. Consequently, one cannot, by looking at the
reconstructed objects in the data file, understand precisely why a
given event passed the trigger. Precisely so as to avoid
serious confusions about triggering, the experiments keep track of the
trigger objects in any given event, as well as the reconstructed
objects, and they keep track of precisely how a triggering decision was
made for each event. Unfortunately, we do not provide this
information with our data sets, which is a genuine structural problem
that makes life more difficult than it should be. In particular,
it makes analysis and use of the calibration sets much more complicated
[as demonstrated clearly to us by the Cornell group -- thanks!]
We will attempt to address and improve these issues in later rounds of
the Olympics, but have not yet decided on the best fix for the Winter
06 Olympics.
For nonspecialists, this should not affect a great deal of what you
want to do. Kinematic reconstruction of mass peaks and endpoints
should not be much affected, and looking at ratios where trigger
decisions and other issues largely cancel (which the experimentalists
do all the time) should help a great deal with the analysis.
Simulating a model will be more subtle, however, and you will need to
make sure that you use the right trigger, which is not the default
trigger for PGS. We will work toward helping participants with
this in the immediate future... stay tuned. The unofficial
HarvardVersion of PGS on the LHC Olympics wiki does have the correct
trigger.
For specialists, or anyone wanting to know how the current trigger
really works, here's what we currently are aware of. Caution:
we may not have everything quite straight yet; we are waiting for
confirmation from the experts that there are no mistakes in what
follows. At the trigger level, the trigger uses
3-dimensional clusters of energy in the calorimeter, hits in the muon
chamber, and rudimentary tracking information to make lists of all the
following candidate objects
a) photon candidates
b) electron candidates
c) muon candidates
d) hadronic tau candidates
e) jet candidates
HOWEVER, any cluster of energy that makes it into (a) also makes it
into (b), and vice versa, so lists (a) and (b) are the same; also, all
objects in (a), (b) and (d) also appear in (e). Thus: there is no
exclusive-or that says that a jet candidate is not a tau candidate, or
that an electron candidate is not a photon candidate. On the
contrary: the trigger is as inclusive as possible: all photon
candidates are also electron candidates and jet candidates. But
not all jet candidates are electron candidates, because not all jets
have a lot of electromagnetic energy compared to their hadronic energy;
so the logic is not reversible.
To say it differently: all large clusters (or clusters of clusters) of
energy are jet candidates; all narrow clusters with few tracks are tau
candidates; all clusters with an abundance of energy in the
electromagnetic calorimeter are both photon and electron
candidates. Muon candidates all come from hits in the muon
chamber, without an isolation requirement.
Also, the trigger system adds up all the energy in the calorimeters as
two-vectors in the plane transverse to the beam, and asks how much
transverse momentum is missing; this is the trigger-level missing
``energy''. Currently, it does not test for the presence of muons
(which don't leave much of their energy in the calorimeters) and so
missing energy at trigger level may simply be due to muons in the
event. (The reconstruction-level
missing energy objects that in the data files do not suffer from this
issue.)
Once the lists of candidates are made, and the trigger-level missing
energy is computed, the trigger makes a decision. Most detectors
have a long list of possible decisions. We will include this
actuality in later versions of the LHC Olympics. Right now, we
have a single floating trigger decision process, based on a single
criterion (which,
in retrospect, given what we actually did, is not intrinsically a great
idea, but was a reasonable idea at the time given what was intended by
the author... ah well)
For any simulated event, we consider all the trigger-level electrons
and muons in the
event with p_T > 10 GeV, all trigger-level taus and jets with
p_T > 100
GeV, and the trigger-level missing transverse energy if it is greater
than 50
GeV. (We demand that at least one electron, muon, tau or jet pass
these cuts; we do not trigger on events with only missing transverse
energy and
soft jets.) From the set of trigger objects which pass these
cuts, we construct
a quantity from the absolute values of the transverse momenta of all
objects in all the trigger lists, weighted as follows: [our earlier
formula had a typo that weighted the leptons with the number 4.0
instead of 5.0]
ht_sum =
sum_{(b),(c)} 5.0
|pt| + sum_{(d)} 0.2 |pt| + sum_{(e)} 0.2 |pt| +
|pt_missing|
where the sums are over the objects in the trigger lists (a), (b), (c),
(d), (e) described above. Notice list (a) does not appear; it is
redundant, since lists (a) and (b) are identical. Since we are
summing over objects in all the trigger object lists, an electron
candidate appears in list (a), list (b) and list (e); taus appear in
list (d) and (e); and muons in list (c) can, as mentioned above,
contribute to the missing pt. Then we demand
that
ht_sum
> 150 GeV
If this is true, we keep the event; the objects in the event are fully
reconstructed and written to the data files. If not, we discard
the event.
The condition chosen above is an extremely crude summary of a number of
the
essential
trigger conditions at the CMS and Atlas detectors, though it still
leaves out many details of the real triggering process, and uses tau's
and b-tags less effectively than is expected to be possible. (However, concerns have been raised that
the weighting of jets is dangerously low; the trigger rejects too many
fully-hadronic top pair production events, for example... a point well
taken.) If it had been applied to the reconstructed
objects, then it would have been very easy for
participants to understand its effects. But since it was applied
to trigger objects, which are not provided to participants, it is
essentially impossible --- currently --- to do so.
Comments
on this section are welcome!
No
Background?!!
A signal without standard model background
is not just highly
unrealistic, it is potentially deeply misleading. These backgrounds are
huge. Many of the features of a new signal can be swamped beyond
repair by standard model processes. It is easy to invent
techniques which will work on a pure signal but will fail when the
signal is contaminated by some standard model background.
Etc.
It is important for
participants in the LHC Olympics to think about this issue
carefully. However,
with this apology, we proceed in this fashion for the moment because
(1) the
problem of extracting information from pure signals is already
nontrivial, and
can serve as a useful learning tool despite its drawbacks, and (2) this
is
the best that we can do at the moment, for important technical reasons;
more on this HERE. In the
future, signals which include backgrounds will be a
part of the LHC Olympics data challenges (though signals without
backgrounds will still be provided for beginners.) For the
moment, if you want to think about the backgrounds to the current black
boxes, some suggestions on how to do so are HERE.
How to
use Calibration Samples
[suggestions to come soon]
Crude
implementation of Standard Model Backgrounds -- possible methods:
How should I think about standard model backgrounds to these signals?
[answers to come soon]
Why
it is so hard to make black
boxes with realistic Standard Model Backgrounds --- a challenge for the
reader:
It is not at all straightforward to
make a suitable standard model background sample for a given black
box!!! Here are just a few of the issues.
All of the largest backgrounds at LHC involve QCD physics, either in
full or in part. For instance, for a signal
in which events containing a single lepton plus jets play a part, the
dominant background is often from a W boson produced in
conjunction with jets. Fine --- this is a standard model process
--- why not just simulate the production of W bosons plus various
numbers of quarks and gluons, and be done with it?
The problem is that it isn't possible. Simulating this process,
or rather, set of processes to a satisfactory degree is beyond the
state of the art.
One problem is that we simply don't know how to generate these events
with good accuracy. Consider the W plus 4 gluons process... we
can calculate the Feynman diagrams using
various event generators. It has a rate of order
(alpha_s)^4. But which alpha_s? It's a running coupling,
and at tree level there's simply no control over the scale \mu at which
it should be evaluated. We need the next-to-leading order process
to be calculated also, in order to reduce the dependence on the choice
of the scale \mu. This hasn't been done; it involves a very
non-trivial set of loop graphs, which have not yet been
calculated. (W plus three jets is at the cutting edge.)
Consequently, we can only guess at the best
choice of \mu, and thus can only guess at the appropriate
alpha_s. Again, alpha_s appears to the fourth power in the
rate. So the rate for this one process is only known to a factor of 2
or 3 or so.
This is only the beginning. To produce the background properly, we
would need to combine the many W plus one jet, W plus two jet, W plus
three jet, W plus four
jet, W plus five jet processes.
Loop corrections have only been performed completely for W plus two
jets; beyond this point, each of the individual W plus n-jet processes
has its own uncertainties, of order a factor of 2 or 3, so the sum of
the rates is
very uncertain. More jets means more powers of alpha_s, which
reduces the absolute rate but increases the relative
uncertainty. And combining these processes in a consistent
way, without double-counting events or incorrectly mixing orders in
perturbation theory, is not trivial.
Another problem is purely technical and would be present even if our
perturbative knowledge of W plus jets were perfect. There are so
many W's produced
at the LHC --- hundreds each second
--- that most of them have to be
thrown away using the triggering
system. By contrast, a typical new signal might have a few
thousands or tens
of thousands of events per year.
If an important part of the
signal involves a lepton and many jets, we will probably have to impose
impose hard cuts --- i.e.,
impose strict kinematic conditions on the events --- that
preserve much of the signal while discarding almost all of the standard
model
background. What standard model background remains (which may
still be much larger than the signal) will be a tiny
fraction of the W plus jets events that LHC actually produces, and
will lie far out on the tail of any standard model
kinematic distribution.
In this context, how should we provide LHC Olympics participants with
this particular background? Practically, we couldn't possibly
provide the full W-plus-jets background, since we
are talking about data sets which are 1000 to 100,000 times larger than
the signal data sets. But suppose, knowing each signal and
its characteristic features, we imposed some cuts in
advance, in order to reduce the W-plus-jets data sets down to the small
fraction of the events that are the
most important backgrounds to a particular signal. This would
mean simulating the tails of the W-plus-jets distributions. We would still have to simulate huge data
sets,
of order 1000 times larger than the signal, in order to obtain these
tails. This would take weeks. Also, the result for
each separate process contributing to that background would be
uncertain by a factor of at least 3, for the reasons mentioned
above, and so the number and type of events remaining, after the
stringent cuts that we would need to impose, could be wrong by as much
as an order of magnitude or more.
Meanwhile, although this is the worst of the backgrounds of importance,
it is hardly the only one. There are also important backgrounds
from Z plus jets, top quark pairs plus jets, diboson (such as WW plus
jets), and pure QCD
(jets-only) backgrounds, among others.
Incidentally, a naive theorist might think one need not care about pure
QCD
light-quark and gluon backgrounds in a sample with leptons. But
this isn't true. Leptons
can be faked, especially hadronic taus.
Even fake electrons and
muons, which occur rarely, are important; the number of QCD events is
so
extraordinarily large that their presence can often be a serious
issue. Also, real
leptons that are sometimes isolated
are generated in decays of bottom and charm
quarks, which are produced in abundance in QCD events.
Finally, even if we could calculate perfectly, in perturbation theory,
the W-plus-n-quarks/gluons backgrounds, we always have to account for
the fact that
n quarks and gluons in a Feynman diagram does not in general equal n
jets in a detector. Making sure we can model the differences
successfully is highly nontrivial, involving resummation of showering
effects, simulation packages and
matching of those packages to data. This has to be done consistently at
the one-loop level, if we want to make use of recent loop calculations
of Feynman graphs; implementing this in the most important processes at
LHC is still at the cutting edge, as in the ongoing MC@NLO
project (Monte Carlo at Next-to-Leading Order). Then there are
uncertainties
that are smaller, but not unimportant, from the parton distribution
functions (pdfs). For certain questions, the lack of precise
knowledge about the gluon pdf and those of the charm and bottom quark
can contribute to important uncertainties about backgrounds. For
instance, the b-quark and c-quark content of the W-plus-jets background
is not
well-known, so we can't at present know with precision the background
to new signals that produce leptons in association with bottom or top
quarks.
[More to be added later]
Fortunately, the experimentalists will be able to combine data and
theory with a lot of cleverness to remove many of these backgrounds
with some degree of accuracy. The crucial question of whether
this can be done reliably, and under what circumstances, is hotly
debated. Eventually we hope that the LHC Olympics will include
black boxes with reasonably simulated standard model backgrounds, and
that these issues will come to the fore in the experts' portion of our
workshops.
If these
problems, which will afflict the entire LHC enterprise, both worry and
interest you, feel free to contact the organizers, especially Matt
Strassler or Steve Mrenna. Many theorists are needed to help with
state-of-the-art loop calculations and to help with modern
event-generation related projects!
AN
IMPORTANT NOTICE FOR
PARTICIPANTS CONCERNING PGS:
NOTE THE FOLLOWING INFORMATION APPLIES
ONLY TO THE FIRST AND SECOND OLYMPICS
The PGS program, written by John Conway, and
modified slightly for LHC Olympics purposes by Steve Mrenna and
friends, serves as the
"detector". But PGS, along with our adjusted algorithms for jet
and muon reconstruction,
has changed since the
summer, so analysis done on the old blackbox data must be redone, and
recalibrated, essentially from scratch. [This actually happens in
real life; for instance, ask your favorite experimentalist on the DZero
detector what
they've been doing during summer 05!]
Moreover, PGS WILL CHANGE AGAIN
before the Winter 2006 LHC Olympics, and WE WILL CONSEQUENTLY HAVE TO RERUN THE
DATA SAMPLES (BLACK BOXES AND CALIBRATION SAMPLES) BEFORE THE WINTER
06 LHC OLYMPICS. You should periodically check this
website for
updates.
However, you should be able to do a reasonably effective analysis no
matter how the detector performs, as long as you use
the relevant
calibration samples, which will help you determine how the detector
is
behaving. [This is what's done in real life, after all.]
Before the Winter 2006 meeting, we
expect to have a public
downloadable version of PGS, allowing you to run your favorite
models
through your own version of the detector simulation.
PGS
4 IS NOW AVAILABLE AND IN THE TESTING PHASE (AS OF JUNE 2006)