Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name = "PointProcesses"
uuid = "af0b7596-9bb0-472a-a012-63904f2b4c55"
version = "0.6.0"
version = "0.7.0"
authors = ["Guillaume Dalle", "José Kling", "Julien Chevallier", "Ryan Senne"]

[deps]
Expand Down
12 changes: 9 additions & 3 deletions docs/examples/Hawkes.jl
Original file line number Diff line number Diff line change
Expand Up @@ -94,10 +94,16 @@ plot(p1, p2, p3; layout=(3, 1), size=(800, 600))
# in PointProcesses.jl to fit a piecewise constant intensity function to the data. This will give us an idea of the average litter box usage over the course of a day.
# While this will not give us any information about self-exciting behavior, it will help us understand the daily patterns of litter box usage.

tod = minute.(data.TimestampDT) .+ 60 .* hour.(data.TimestampDT) # time-of-day in minutes since midnight
tod_raw = Float64.(minute.(data.TimestampDT) .+ 60 .* hour.(data.TimestampDT)) # time-of-day in minutes since midnight
day = Date.(data.TimestampDT)
n_days = length(unique(day))
h_day = History(sort(Float64.(tod)), 0.0, 1440.0) # build a "history" on [0, 1440] minutes
# Collapsing many days into a single [0, 1440) window produces events at the
# same minute-of-day across different days. Add a tiny deterministic jitter
# (≪ the 15-minute bin width below) so the events are strictly ordered for
# `History`. The piecewise-constant fit aggregates into bins, so the jitter is
# invisible to the resulting intensity.
tod = sort(tod_raw .+ (1:length(tod_raw)) .* 1e-6)
h_day = History(tod, 0.0, 1440.0) # build a "history" on [0, 1440] minutes
nbins = 96 # 96 bins = 15-minute bins
pp_day = fit(
InhomogeneousPoissonProcess{PiecewiseConstantIntensity{Float64},Dirac{Nothing}},
Expand All @@ -121,7 +127,7 @@ plot(
# Now, let's fit a Hawkes process to the data. We will for now, ignore the cat identities and fit a single Hawkes process to all the data.
# The goal of this analysis is to understand the self-exciting nature of the litter box entries. I.e., if one cat uses the litter box,
# does that increase the likelihood of another cat using it soon after? To do this, we can use the implementation in PointProcesses.jl
full_history = History(data.t, 0.0, maximum(data.t) + 1.0)
full_history = History(sort(data.t), 0.0, maximum(data.t) + 1.0)
hawkes_model = fit(HawkesProcess, full_history)

println("Fitted Hawkes Process Parameters:") # hide
Expand Down
78 changes: 56 additions & 22 deletions docs/src/examples/Hawkes.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ where μ(t) is the baseline intensity, which accounts for spontaneous events, ``
which controls how past events influence future events.

One of the most common choices for this kernel is the exponential function: ``\phi(s) = \alpha\exp{-\beta s}``. In this model, α is the strength of self-exciting, and β controls the rate of excitation decay.
Given these, parameters, the intensity function is: ``λ(t|\mathcal H_t) = μ + \sum_{t_i < t} \alpha \exp{-\beta (t - t_i)}``. We assume that μ is constant for simplicity. But, we could generalzie this to be inhomogeneous.
Given these parameters, the intensity function is: ``λ(t|\mathcal H_t) = μ + \sum_{t_i < t} \alpha \exp{-\beta (t - t_i)}``. We assume that μ is constant for simplicity. But, we could generalize this to be inhomogeneous.

An important statistic of the Hawkes proces, is the branching ratio which is a measure of the expected number of "daughter" events, a "parent" event will create. For the exponential kernel, this is given by the following equation:
An important statistic of the Hawkes process is the branching ratio which is a measure of the expected number of "daughter" events, a "parent" event will create. For the exponential kernel, this is given by the following equation:
``n = \frac{\alpha}{\beta}``. In the event that n < 1, the process is subcritical, meaning that events will eventually die out. If n = 1, the process is critical, and if n > 1, the process is supercritical, meaning that events can lead to an infinite cascade of events.

Let's now apply this theory, and develop more through the process, by fitting a Hawkes process to some data.
Expand All @@ -36,7 +36,8 @@ First let's open our data. This data records litter box entries taken from three

````@example Hawkes
root = pkgdir(PointProcesses)
data = CSV.read(joinpath(root, "docs", "examples", "data", "cats.csv"), DataFrame)
data = CSV.read(joinpath(root, "docs", "examples", "data", "cats.csv"), DataFrame);
nothing #hide
````

This dataset has three cats in it, so let's first separate out each cat, using their weight to infer their identity
Expand All @@ -46,7 +47,8 @@ cat_weights = [parse(Float64, split(i)[1]) for i in data.Value]
clusters = kmeans(reshape(cat_weights, 1, :), 3; maxiter=100, display=:none)

data.CatWeight = cat_weights
data.CatID = clusters.assignments
data.CatID = clusters.assignments;
nothing #hide
````

Now that we have set this up we need to get the timestamps in a useable order for plotting. We can use the Dates package for this.
Expand All @@ -70,7 +72,9 @@ Since this dataset only has resolution up to the nearest minute, we need to chec

````@example Hawkes
if any(diff(sort(data.t)) .== 0)
println("Data contains tied event times. Please add small jitter to event times to make them unique.")
println(
"Data contains tied event times. Please add small jitter to event times to make them unique.",
)
else
println("Data contains no tied event times. Proceed with fitting. Yay!")
end
Expand All @@ -79,8 +83,23 @@ end
Great! We can now move forward. First, let's visualize the data using a simple event plot.

````@example Hawkes
function eventplot(event_times::Vector{Float64}; title="Event Plot", xlabel="Time (minutes)", ylabel="Events")
scatter(event_times, ones(length(event_times)); markershape=:vline, markersize=10, label="", title=title, xlabel=xlabel, ylabel=ylabel, yticks=false)
function eventplot(
event_times::Vector{Float64};
title="Event Plot",
xlabel="Time (minutes)",
ylabel="Events",
)
scatter(
event_times,
ones(length(event_times));
markershape=:vline,
markersize=10,
label="",
title=title,
xlabel=xlabel,
ylabel=ylabel,
yticks=false,
)
end

cat1_times = data.t[data.CatID .== 1]
Expand All @@ -100,10 +119,20 @@ in PointProcesses.jl to fit a piecewise constant intensity function to the data.
While this will not give us any information about self-exciting behavior, it will help us understand the daily patterns of litter box usage.

````@example Hawkes
tod = minute.(data.TimestampDT) .+ 60 .* hour.(data.TimestampDT) # time-of-day in minutes since midnight
tod_raw = Float64.(minute.(data.TimestampDT) .+ 60 .* hour.(data.TimestampDT)) # time-of-day in minutes since midnight
day = Date.(data.TimestampDT)
n_days = length(unique(day))
h_day = History(sort(Float64.(tod)), 0.0, 1440.0) # build a "history" on [0, 1440] minutes
````

Collapsing many days into a single [0, 1440) window produces events at the
same minute-of-day across different days. Add a tiny deterministic jitter
(≪ the 15-minute bin width below) so the events are strictly ordered for
`History`. The piecewise-constant fit aggregates into bins, so the jitter is
invisible to the resulting intensity.

````@example Hawkes
tod = sort(tod_raw .+ (1:length(tod_raw)) .* 1e-6)
h_day = History(tod, 0.0, 1440.0) # build a "history" on [0, 1440] minutes
nbins = 96 # 96 bins = 15-minute bins
pp_day = fit(
InhomogeneousPoissonProcess{PiecewiseConstantIntensity{Float64},Dirac{Nothing}},
Expand All @@ -130,11 +159,8 @@ The goal of this analysis is to understand the self-exciting nature of the litte
does that increase the likelihood of another cat using it soon after? To do this, we can use the implementation in PointProcesses.jl

````@example Hawkes
full_history = History(data.t, 0.0, maximum(data.t) + 1.0)
hawkes_model = fit(
HawkesProcess,
full_history,
)
full_history = History(sort(data.t), 0.0, maximum(data.t) + 1.0)
hawkes_model = fit(HawkesProcess, full_history)

println("Fitted Hawkes Process Parameters:") # hide
println("Base intensity (μ): ", hawkes_model.μ) # hide
Expand All @@ -153,17 +179,21 @@ Finally, we can visualize the fitted intensity function over time.
````@example Hawkes
ts = sort(data.t)

λ_hawkes(t::Real) =
function λ_hawkes(t::Real)
hawkes_model.μ +
sum((hawkes_model.α * exp(-hawkes_model.ω * (t - ti)) for ti in ts if ti < t); init=0.0)
end

u = range(0.0, maximum(ts) + 1.0; length=2000)

plot(u, λ_hawkes.(u);
xlabel="Time (minutes)",
ylabel="Fitted Hawkes intensity (events/min)",
title="Fitted Hawkes Process Intensity Function",
legend=false)
plot(
u,
λ_hawkes.(u);
xlabel="Time (minutes)",
ylabel="Fitted Hawkes intensity (events/min)",
title="Fitted Hawkes Process Intensity Function",
legend=false,
)
````

From the plot, we can observe how the intensity function varies over time, capturing the self-exciting nature of the litter box entries.
Expand All @@ -176,11 +206,15 @@ test = MonteCarloTest(KSDistance{Exponential}, hawkes_model, full_history; n_sim

p = pvalue(test) # hide
println("Monte Carlo Test p-value for Hawkes Process fit: ", p) # hide
println("Assuming a significance level of 0.05, we " * (p < 0.05 ? "reject" : "fail to reject") * "",) # hide
println(
"Assuming a significance level of 0.05, we " *
(p < 0.05 ? "reject" : "fail to reject") *
"",
) # hide
println("the null hypothesis that the Hawkes process is a good fit to the data.") # hide
````

From this analysis, it seems that we can can say that litter-box usage is indeed a self-exciting process, as the fitted Hawkes process provides a good fit to the data.
From this analysis, it seems that we can say that litter-box usage is indeed a self-exciting process, as the fitted Hawkes process provides a good fit to the data.
It's worth considering that we have ignored the identities of the cats in this analysis. A more thorough analysis could involve fitting a multivariate Hawkes process,
where each cat is represented as a separate dimension. This would allow us to capture the interactions between the cats more accurately. This will be the subject of a
future tutorial.
Expand Down
23 changes: 23 additions & 0 deletions docs/src/examples/Inhomogeneous.md
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,29 @@ scatter!(
)
````

## Goodness-of-Fit via Time Rescaling
By the time-rescaling theorem, if a fitted intensity ``\hat{λ}(t)`` is correct, then the
events transformed by the compensator ``Λ(t) = ∫_{t_{\min}}^{t} \hat{λ}(s) ds`` form a
unit-rate Poisson process. We can use `MonteCarloTest` with `KSDistance{Exponential}` to
check whether the transformed inter-event times look Exp(1):

````@example Inhomogeneous
pp_poly_full = InhomogeneousPoissonProcess(pp_poly)
pp_gauss_full = InhomogeneousPoissonProcess(pp_gauss)

mc_piecewise = MonteCarloTest(KSDistance{Exponential}, pp_piecewise, h; n_sims=500, rng=rng)
mc_poly = MonteCarloTest(KSDistance{Exponential}, pp_poly_full, h; n_sims=500, rng=rng)
mc_gauss = MonteCarloTest(KSDistance{Exponential}, pp_gauss_full, h; n_sims=500, rng=rng)

println("\nGoodness-of-fit p-values (Monte Carlo KS test):") # hide
println(" Piecewise Constant: ", round(pvalue(mc_piecewise); digits=3)) # hide
println(" Polynomial: ", round(pvalue(mc_poly); digits=3)) # hide
println(" Gaussian (Custom): ", round(pvalue(mc_gauss); digits=3)) # hide
````

A small p-value provides evidence against the null hypothesis that the model captures the
observed firing pattern, while a large p-value indicates the model is consistent with the data.

## Model Comparison
Let's compare all models by computing their negative log-likelihoods:

Expand Down
79 changes: 64 additions & 15 deletions src/history.jl
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,23 @@ Linear event histories with temporal locations of type `T` and marks of type `M`
- `marks::Vector{M}`: associated vector of event marks
- `dims::Vector{D}`: associated vector of event dimensions
- `N::Int`: number of dimensions

# Construction

The constructor validates its inputs and throws on any violation rather than
silently coercing them. With `check_args=true` (the default), it requires:

- `tmin < tmax`
- `length(times) == length(marks) == length(dims)`
- `issorted(times)`
- every event time lies in the half-open interval `[tmin, tmax)`
- no two events share a time *within the same dimension* (equal times across
different dimensions are allowed, representing simultaneous events on
different components of a multivariate process)
- if `dims` is provided, every value lies in `1:N`

Pass `check_args=false` to skip validation when the caller can guarantee the
invariants hold (e.g. for performance in inner simulation loops).
"""
struct History{T<:Real,M,D}
times::Vector{T}
Expand Down Expand Up @@ -46,17 +63,39 @@ struct History{T<:Real,M,D}
)
end
if !isempty(times)
perm = sortperm(times)
times .= times[perm]
marks .= marks[perm]
dims .= dims[perm]
if !issorted(times)
throw(
DomainError(
times, "Event times must be sorted in non-decreasing order."
),
)
end
if times[1] < tmin || times[end] >= tmax
@warn "Events outside of provided interval were discarded."
il = searchsortedfirst(times, tmin)
ir = searchsortedfirst(times, tmax) - 1
times = times[il:ir]
marks = marks[il:ir]
dims = dims[il:ir]
throw(
DomainError(
(times[1], times[end], tmin, tmax),
"All event times must lie in the half-open interval [tmin, tmax).",
),
)
end
# Repeated event times are allowed only across different
# dimensions (a simultaneous event in two components of a
# multivariate process). Within a single dimension, two events
# at the same time are pathological for a simple point process.
for i in 2:length(times)
times[i] == times[i - 1] || continue
j = i - 1
while j >= 1 && times[j] == times[i]
if dims[j] == dims[i]
throw(
DomainError(
(times[i], dims[i]),
"Repeated event time within a single dimension.",
),
)
end
j -= 1
end
end
end
if N == 1
Expand Down Expand Up @@ -314,6 +353,12 @@ duration(h::History) = max_time(h) - min_time(h)
push!(h, t, m)

Add event `(t, m)` inside the interval `[h.tmin, h.tmax)` at the end of history `h`.

With `check_args=true` (the default), the event must satisfy
`h.tmin <= t < h.tmax`, occur at or after the last existing event time, and
(if `h` is multivariate) lie in dimension `d ∈ 1:h.N`. Violations throw an
`AssertionError`. Pass `check_args=false` to skip these checks in trusted
inner loops.
"""
function Base.push!(h::History, t::Real, m=nothing, d=nothing; check_args=true)
if check_args
Expand All @@ -331,6 +376,13 @@ end
append!(h, ts, ms)

Append events `(ts, ms)` inside the interval `[h.tmin, h.tmax)` at the end of history `h`.

With `check_args=true` (the default), `ts` must be sorted, lie in
`[h.tmin, h.tmax)`, start at or after the last existing event in `h`, and
have matching lengths with `ms` and `ds`. If `h` is multivariate, every
dimension in `ds` must be in `1:h.N`. Violations throw an `AssertionError`.
The caller's `ts`/`ms`/`ds` vectors are *not* mutated. Pass `check_args=false`
to skip these checks in trusted inner loops.
"""
function Base.append!(
h::History,
Expand All @@ -343,13 +395,10 @@ function Base.append!(
return nothing
end
if check_args
perm = sortperm(ts)
ts .= ts[perm]
ms .= ms[perm]
ds .= ds[perm]
@assert length(ts) == length(ms) == length(ds)
@assert issorted(ts)
@assert h.tmin <= ts[1] && ts[end] < h.tmax
@assert isempty(h) || (h.times[end] <= ts[1])
@assert length(ts) == length(ms) == length(ds)
@assert (eltype(ds) == Nothing && h.N == 1) || all(1 .<= ds .<= h.N)
end
append!(h.times, ts)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -122,18 +122,31 @@ The transformed event times form a unit-rate (homogeneous) Poisson process on
function time_change(h::History, pp::InhomogeneousPoissonProcess)
f = pp.intensity_function
config = pp.integration_config
new_tmax = integrated_intensity(f, h.tmin, h.tmax, config)
T = typeof(new_tmax)
new_times = T[integrated_intensity(f, h.tmin, t, config) for t in h.times]
# `new_tmax` and each `new_times[i]` come from independent quadrature calls,
# so solver / floating-point error can both swap adjacent transformed
# events and let one of them slightly exceed `new_tmax`. Widen `new_tmax`
# to the observed maximum (with an `eps` cushion) so no event is dropped
# at the boundary; small order drift is then handled by the History
# constructor's automatic `sortperm` pass under `check_args=true`.
if !isempty(new_times)
new_tmax = max(new_tmax, maximum(new_times) * (one(T) + eps(T)))
edges = vcat(h.tmin, h.times, h.tmax)
deltas = [
integrated_intensity(f, edges[i], edges[i + 1], config) for
i in 1:(length(edges) - 1)
]
T = eltype(deltas)
i_neg = findfirst(<(zero(T)), deltas)
if i_neg !== nothing
throw(
DomainError(
deltas[i_neg],
"time_change: integrated intensity over " *
"[$(edges[i_neg]), $(edges[i_neg + 1])] is negative. " *
"This usually reflects floating-point jitter from adaptive " *
"quadrature on an interval where the true integral is near zero, " *
"or an `intensity_function` that returns negative values. " *
"Tightening the tolerance in `integration_config` may help if " *
"the true integral is non-negligible; otherwise verify that " *
"`intensity_function` is non-negative on the support of `h`.",
),
)
end
cs = cumsum(deltas)
new_times = cs[1:(end - 1)]
new_tmax = cs[end]
return History(; times=new_times, tmin=zero(T), tmax=new_tmax, marks=h.marks)
end

Expand Down
Loading
Loading