Skip to content

Conversation

@epugh
Copy link
Contributor

@epugh epugh commented Nov 30, 2025

https://issues.apache.org/jira/browse/SOLR-18008

Description

I've added some tests that simulate what happens if you have remnant core files that collide with new core creation.

The first scenario is when you call a create collection command, and on one of the cores, there already is remnant remaining.

The second scenario is when you call a add replica to a collection command, and the replica is attempting to be created where a remnant exists.

The third scenario is a test of CAN you manually delete a core remnant by using the Unload Core direct command. Turns out you can't because Solr checks for a core descriptor, and since it doesn't exist in ZK, it won't continue. I suppose you could use the setting to ignore that check and allow it to go through? However, scenarios one and two cover it.

Solution

Look at what changes to existing logic are required to deal with remnants existing. I'm not trying to figure out why we get remnants, as I don't have visibility into how it happens at this time.

Two scenarios that I've seen are adding a new collection where a core remnants already exists and adding a replica where the core remnants already exists.

Tests

Added a Junit test that demonstrates with and without the parameter. I wanted both in case at some point in the future the default behavior changes when it comes to remnants. I will delete the BATS test before merging.

@epugh epugh changed the title bats test to simulate remnant core SOLR-18008: bats test to simulate various issues with remnant core Dec 1, 2025
@epugh epugh changed the title SOLR-18008: bats test to simulate various issues with remnant core SOLR-18008: Remnant core files prevent Collections for working Dec 1, 2025
Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not see this solution...

@epugh epugh marked this pull request as ready for review December 31, 2025 15:07
@github-actions github-actions bot added documentation Improvements or additions to documentation client:solrj labels Dec 31, 2025
@epugh
Copy link
Contributor Author

epugh commented Dec 31, 2025

Okay, two things.

  1. If the JUnit test looks good, then I'll delete the bats test.

  2. I'm thinking that maybe we want to slip into the current 10x release process a update to the system property name. In this PR i removed the .startup. from the name because it's not a startup ONLY thing. However in release 10.0 we moved the name from solr.deleteUnknownCores to solr.cloud.startup.delete.unknown.cores.enabled.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses SOLR-18008, which prevents collections from working when remnant core files exist on disk. The solution introduces automatic deletion of remnant core directories when creating new cores, controlled by renaming and properly configuring the system property from solr.cloud.startup.delete.unknown.cores.enabled to solr.cloud.delete.unknown.cores.enabled.

Key Changes

  • Renamed system property to solr.cloud.delete.unknown.cores.enabled for broader applicability beyond just startup scenarios
  • Added core deletion logic in CoreContainer.create() to handle remnants during core creation
  • Comprehensive test coverage for collection creation and replica addition scenarios with remnants

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
solr/solrj/src/resources/DeprecatedSystemPropertyMappings.properties Maps the old property name to the new property name for backward compatibility, though the mapping appears to have an issue with direction
solr/solr-ref-guide/modules/configuration-guide/pages/solr-properties.adoc Updates documentation to reflect the renamed property
solr/packaging/test/test_create_collection_with_remnants.bats Adds BATS integration tests for remnant handling (marked for deletion before merge)
solr/core/src/test/org/apache/solr/cloud/DeleteInactiveReplicaTest.java Updates existing test to use the new property name
solr/core/src/test/org/apache/solr/cloud/DeleteCoreRemnantsOnCreateTest.java Adds comprehensive JUnit tests covering collection creation and replica addition with remnants
solr/core/src/java/org/apache/solr/core/CoreContainer.java Implements the core deletion logic and updates property name references throughout

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1517 to 1522
if (deleteUnknownCores && Files.exists(cd.getInstanceDir())) {
log.warn(
"There appears to be an existing directory for core {}, now deleting it",
cd.getName());
SolrCore.deleteUnloadedCore(cd, true, true);
}
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion logic on lines 1517-1522 deletes the remnant directory immediately before calling coresLocator.create() on line 1526. However, there's a potential race condition: if another process or thread creates files in the instance directory between the deletion (line 1521) and the coresLocator.create() call (line 1526), this could lead to unexpected behavior. Consider checking if the directory exists again right before calling coresLocator.create() or handling potential FileAlreadyExistsException from coresLocator.create().

Copilot uses AI. Check for mistakes.
@epugh epugh requested a review from dsmiley January 3, 2026 14:32
@epugh
Copy link
Contributor Author

epugh commented Jan 3, 2026

@dsmiley tests are cleaned up, and I investigated if the core admin level api would work (it doesn't!). I'd love to get this in soon if it looks okay to you.

@epugh epugh closed this Jan 3, 2026
@epugh epugh reopened this Jan 3, 2026
configureCluster(1).addConfig("conf", configset("cloud-minimal")).configure();
}

@Before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needless

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about that as well, and then got the "well, maybe it is requried", I should have tried removing it!

Comment on lines +75 to +77
Replica primaryReplica = getReplicaOnNode(collectionName, "shard1", primaryNode);
JettySolrRunner primaryJetty = cluster.getReplicaJetty(primaryReplica);
String originalCoreName = primaryReplica.getCoreName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead you could simply call org.apache.solr.core.CoreContainer#getAllCoreNames and grab the first.

@@ -44,7 +44,7 @@ NOTE: Properties marked with "!" indicate inverted meaning between pre Solr 10 a

|solr.cloud.prep.recovery.read.timeout.additional.ms|prepRecoveryReadTimeoutExtraWait|8000|Specifies additional milliseconds to wait during recovery read operations in SolrCloud mode.

|solr.cloud.startup.delete.unknown.cores.enabled|solr.deleteUnknownCores|false|Controls whether unknown cores are deleted at startup in SolrCloud mode.
|solr.cloud.delete.unknown.cores.enabled|solr.deleteUnknownCores|false|Determines if unknown cores should be removed at startup and during collection and replica creation in SolrCloud mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I view these things as separate:

  • Deleting unexpected cores (according to ZK state.json as truth) on startup.
  • Deleting unexpected preexisting core data on replica creation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it another stab...

@epugh
Copy link
Contributor Author

epugh commented Jan 4, 2026

Responded to all the feedback, I think this is ready to go in the next few days!

NOTE: In previous version of Solr, the `min_rf` parameter had to be specified to ask Solr for the achieved replication factor.
Now it is always included in the response.

== Solr Core Lifecycle Inconsistency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate you want to document this, and that this page isn't a terrible/obviously-wrong choice, but IMO it further degrades a very important page that should be more focused on how key aspects of SolrCloud works.

Perhaps what we should do is log a warning with this suggestion that will change. Maybe also reference the JIRA issue where an interested user can get more background / advice. After all, it's an obscure matter. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humm... log a warning if we bump into the core remnant problem and you haven't enabled the feature, and reference that JIRA? That could work...

What do you think about a FAQ style page or just a new page about "tips and tricks with SolrCloud", would that be generic enough that this and other bits of advice could go there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we have a number of tips & tricks, it'd be a pretty lame FAQ. And the number of such that might be applicable could be vast for Solr. And I wonder if anyone would bother to actually read the FAQ.

Another suitable page might be the collection creation or replica creation APIs.

But my preference remains for logging a warning with a reference to the setting. It's not either-or.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I moved the text to the create collection, and then I added a warning from solr when the situation is hit. Thanks!


final boolean deleteUnknownCores =
EnvUtils.getPropertyAsBool("solr.cloud.delete.unknown.cores.enabled", false);
if (deleteUnknownCores && Files.exists(cd.getInstanceDir())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this logic be adjacent to the logic that fails in CoreProperitesLocator.exists? (I assume that's the spot). Doing so would be better organized and probably lead to combining in a more clear way with coordinated logging and existence check. I forget if this was discussed already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe... Honestly, in this PR, I was attempting to make minimal changes to CoreContainer as it's a beast already and NOT an area i have worked much in the past. I don't have a good overall understanding of it as this PR is my first time actually touching it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CorePropertiesLocator was the beast (not CoreContainer), that would explain why you did what you did, but this is the inverse situation, and you chose to touch CoreContainer when you could have avoided it. Let's move it then; ehh? Any way, I'm here for peer review so don't be afraid ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me two tries to understand your point, and then it clicked.... I made the change, and am running tests and then will push up...

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still these this should be a separate setting than combined but I won't push the matter any further.

SolrCore.deleteUnloadedCore(cd, true, true);
} else {
log.warn(
"Directory at [{}] for core[{}] already exists preventing create operation. Set solr.cloud.delete.unknown.cores.enabled=true to delete directory. (SOLR-18008)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change "preventing" to "may prevent". As I look at the logic, it appears an empty dir may work since the code following checkForExistingCore only looks for core.properties existing.

checkForExistingCore(cd);
Path propertiesFile = cd.getInstanceDir().resolve(PROPERTIES_FILENAME);
if (Files.exists(propertiesFile))
throw new SolrException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine but I want to clarify/remind that my last review suggested the move of the logic into here could be an opportunity to integrate with the existing logic. It means triggering the logs/errors based on core.properites instead of the parent dir (seems fine). Any way, this is fine if you wish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:cloud client:solrj documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants