Skip to content

Conversation

@f-f
Copy link
Member

@f-f f-f commented Jan 11, 2026

This PR integrates into the Job Queue framework the three "Registry scripts" that we run daily to keep data tidy: the package transferrer, the package set upgrades, and the legacy importer - as detailed here

The patch is simpler than I though it would be, and I think that's mainly because we only deal with packages that are already in the registry, rather than packages that could possibly be missing. As a result, the package transfer script and the package set updater are deprecated, while the legacy importer stays in place since it's the only way we have to bootstrap the registry from scratch at the moment.

@f-f f-f requested a review from thomashoneyman January 11, 2026 23:08
Comment on lines 25 to 37
-- Initialize registry repo before launching parallel processes, to avoid
-- race condition where both Scheduler and Job Executor try to clone the
-- Registry at the same time
void $ runEffects env do
Log.info "Initializing registry repo..."
Registry.readAllMetadata
liftEffect do
case env.vars.resourceEnv.healthchecksUrl of
Nothing -> Console.log "HEALTHCHECKS_URL not set, healthcheck pinging disabled"
Just healthchecksUrl -> Aff.launchAff_ $ healthcheck healthchecksUrl
Aff.launchAff_ $ withRetryLoop "Scheduler" $ Scheduler.runScheduler env
Aff.launchAff_ $ withRetryLoop "Job executor" $ JobExecutor.runJobExecutor env
Router.runRouter env
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this was a bit annoying - I added another fiber to schedule jobs, and that resulted in two fibers pulling the registry at the same time, and having a race condition around the creation of the folder. I am not sure what's the best way to go at it - I thought about merging the two loops (i.e. pull jobs unless 12h have passed from the last time we run the crons, in which case loop through the registry/github and enqueue stuff, then go back to running jobs), but the package importer might just take a long time since it's hitting github for each package, and we can't afford that kind of slowdown. So for now I am synchronously pulling the registry first thing, but I am not a big fan of this since it doesn't prevent more race conditions in the future.

Copy link
Member

@thomashoneyman thomashoneyman Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on this race condition. In short, both the Scheduler and JobExecutor fibers call into the REGISTRY effect, which triggers git clone or pulls, and in this specific case when both fibers start together they race to create the same directory, causing failures.

We could merge the two loops. The daily update job could end up taking a long time though, as you said, because it fetches tags from all repos and may get rate-limited. That said, if you just want to get this PR over the line perhaps it's acceptable for now. I wrote out a more full solution below but it's a bit riskier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proper solution here is per-repo locks where we add a mutex per git repository inside the REGISTRY handler, and all git operations for a given repo go through this lock. It has some advantages, but also we start running up against the limitations of run as an effects system.

The classic solution would be to use AVar as a way to produce locks in Aff, and use Aff.bracket to acquire the lock, run the action, and release the lock on completion (this is exception-safe, so on exception the lock would also be released).

However, we can't use Aff.bracket because it would require the actions to be run in Aff and our actions intersperse other effects like LOG. You can't enter Aff and then "re-enter" Run like that. There are workarounds but they're very ugly.

So that leaves us with more manual lock management, which ordinarily is a bummer but in our case is actually not too bad. The basic idea would be:

  1. Add per-repo locks to the registry env, keyed by process in registry.purs

The point of keying by process is for cleanup in the case of an exception -- described later.

data Process = Scheduler | JobExecutor

type RepoLock = { lock :: AVar Unit, owner :: Ref (Maybe ProcessId) }

type RepoLocks = Ref (Map RepoKey RepoLock)

type RegistryEnv = { ..., repoLocks :: RepoLocks }
  1. Add a withRepoLock helper function
withRepoLock :: forall r a. ProcessId -> RepoLocks -> RepoKey -> Run (LOG + AFF + EFFECT + r) a -> Run _ a
withRepoLock process locks key action = do
  lock <- getOrCreateLock locks key
  Run.liftAff $ AVar.take repoLock.lock
  Run.liftEffect $ Ref.modify_ (_ { owner = Just process }) repoLock
  result <- action
  Run.liftEffect $ Ref.modify_ (_ { owner = Nothing }) repoLock.owner
  Run.liftAff $ AVAr.put unit repoLock.lock
  pure result
  1. Use the withRepoLock function in the registry handler
pull key = withRepoLock processId env.repoLocks key do
  -- existing logic
  1. Clean up locks on startup/job failure

The basic approach to exception safety is that an exception will kill the fiber (fibers can also be killed by the parent, e.g. on a timeout). Then, the executor/scheduler loops get restarted. At that point we can clear any held locks for the process which restarts -- this avoids the case where an exception leaves the lock held, and then it remains stale.

clearOwnLocks :: ProcessId -> RepoLocks -> Run _ Unit
clearOwnLocks process locksRef = do
  locks <- readLocks
  for_ (Map.toUnfoldable locks :: Array _) \(Tuple key repoLock) -> do
    { owner } <- Run.liftEffect $ Ref.read repoLock
    when (process == Just owner) do
      Log.warn $ "Clearing orphaned lock for " <> process
      -- clear the lock

When the loops start or fail, clear the locks

runJobExecutor env = runEffects env do
  clearOwnLocks JobExecutor env.registryEnv.repoLocks
  Log.info "Starting ..."
  loop

When we kill a fiber, same deal:

jobResult <- liftAff do
  let timeout = Aff.delay (Milliseconds delay) $> Nothing 
  ...
  
case jobResult of
  Nothing -> do
      Log.error "timed out"
      clearOwnLocks JobExecutor ...
      pure false

  Just (Left _) -> do
    Log.error "died"
    clearOwnLocks ...

This should work in each scenario:

  • Normal completion: the lock is released, as per withRepoLock
  • Job timeout or error, clearOwnLocks runs on the job result.
  • Executor/scheduler crashes, gets restarted, clearOwnLocks on startup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thought: clearing locks is for an exceptional scenario where something has gone wrong. We have other consequences besides stale locks that can happen here — for example, what if we were in the middle of a clone? Then we'll end up with a corrupted repo and the registry will be borked.

So we might want a more general cleanup function that does more than clear locks. It could also run gitCLI [ "rev-parse", "--git-dir" ] to check each repo is valid and delete it if it isn't. (On pull we already do a force-clean, so stale state is OK).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good writeup! I think we should just add this locking functionality, introducing the scheduler pretty much guarantees race conditions with any jobs that are running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have applied this suggestion, but I think it needs a bit more work

Nothing -> do
-- Use the lowest compiler from previous version for compatibility,
-- falling back to latest if no previous version exists
compiler <- case Map.findMax metadata.published of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine but it’s technically possible to publish a package version that dates to an older compiler (like a bug fix to an old series of the package for an earlier compiler), so we have a reliable way to determine the compiler from the resolved packages, as is done in the legacy importer (compilerFromResolutions) and could use that here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks for the pointer, will use that instead!

Copy link
Member

@thomashoneyman thomashoneyman Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, it's actually called findFirstFromResolutions and is a helper function inlined, hopefully not too much trouble to move it into the app somewhere and import it back into the importer. Or it may be too tied-in to the importer caches and you want to just create a new function.

findFirstFromResolutions :: Map PackageName Version -> Run _ (Either (Map Version CompilerFailure) Version)
findFirstFromResolutions resolutions = ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied this idea, but I kept the existing logic for the case where the package doesn't have any dependencies. In that case it still makes sense to try to take the same compiler as the previous version, instead of falling back to the latest one.

Copy link
Member

@thomashoneyman thomashoneyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed a few things that need to be resolved before we merge. Also, I went ahead and removed the daily jobs workflow from the registry repository since this replaces it:
purescript/registry#525

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants