-
Notifications
You must be signed in to change notification settings - Fork 800
SOLR-18008: Remnant core files prevent Collections for working #3904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
dsmiley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not see this solution...
Replica adds previously failed too.
…res.enabled setting.
|
Okay, two things.
|
Some reorganization of tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses SOLR-18008, which prevents collections from working when remnant core files exist on disk. The solution introduces automatic deletion of remnant core directories when creating new cores, controlled by renaming and properly configuring the system property from solr.cloud.startup.delete.unknown.cores.enabled to solr.cloud.delete.unknown.cores.enabled.
Key Changes
- Renamed system property to
solr.cloud.delete.unknown.cores.enabledfor broader applicability beyond just startup scenarios - Added core deletion logic in
CoreContainer.create()to handle remnants during core creation - Comprehensive test coverage for collection creation and replica addition scenarios with remnants
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| solr/solrj/src/resources/DeprecatedSystemPropertyMappings.properties | Maps the old property name to the new property name for backward compatibility, though the mapping appears to have an issue with direction |
| solr/solr-ref-guide/modules/configuration-guide/pages/solr-properties.adoc | Updates documentation to reflect the renamed property |
| solr/packaging/test/test_create_collection_with_remnants.bats | Adds BATS integration tests for remnant handling (marked for deletion before merge) |
| solr/core/src/test/org/apache/solr/cloud/DeleteInactiveReplicaTest.java | Updates existing test to use the new property name |
| solr/core/src/test/org/apache/solr/cloud/DeleteCoreRemnantsOnCreateTest.java | Adds comprehensive JUnit tests covering collection creation and replica addition with remnants |
| solr/core/src/java/org/apache/solr/core/CoreContainer.java | Implements the core deletion logic and updates property name references throughout |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
solr/core/src/test/org/apache/solr/cloud/DeleteCoreRemnantsOnCreateTest.java
Show resolved
Hide resolved
solr/solrj/src/resources/DeprecatedSystemPropertyMappings.properties
Outdated
Show resolved
Hide resolved
| if (deleteUnknownCores && Files.exists(cd.getInstanceDir())) { | ||
| log.warn( | ||
| "There appears to be an existing directory for core {}, now deleting it", | ||
| cd.getName()); | ||
| SolrCore.deleteUnloadedCore(cd, true, true); | ||
| } |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deletion logic on lines 1517-1522 deletes the remnant directory immediately before calling coresLocator.create() on line 1526. However, there's a potential race condition: if another process or thread creates files in the instance directory between the deletion (line 1521) and the coresLocator.create() call (line 1526), this could lead to unexpected behavior. Consider checking if the directory exists again right before calling coresLocator.create() or handling potential FileAlreadyExistsException from coresLocator.create().
solr/core/src/test/org/apache/solr/cloud/DeleteCoreRemnantsOnCreateTest.java
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…nto simulate_solr_core_remnants
|
@dsmiley tests are cleaned up, and I investigated if the core admin level api would work (it doesn't!). I'd love to get this in soon if it looks okay to you. |
| configureCluster(1).addConfig("conf", configset("cloud-minimal")).configure(); | ||
| } | ||
|
|
||
| @Before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered about that as well, and then got the "well, maybe it is requried", I should have tried removing it!
solr/core/src/test/org/apache/solr/cloud/DeleteCoreRemnantsOnCreateTest.java
Outdated
Show resolved
Hide resolved
| Replica primaryReplica = getReplicaOnNode(collectionName, "shard1", primaryNode); | ||
| JettySolrRunner primaryJetty = cluster.getReplicaJetty(primaryReplica); | ||
| String originalCoreName = primaryReplica.getCoreName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead you could simply call org.apache.solr.core.CoreContainer#getAllCoreNames and grab the first.
solr/core/src/test/org/apache/solr/cloud/DeleteCoreRemnantsOnCreateTest.java
Outdated
Show resolved
Hide resolved
changelog/unreleased/SOLR-18008-simulate_solr_core_remnants.yml
Outdated
Show resolved
Hide resolved
| @@ -44,7 +44,7 @@ NOTE: Properties marked with "!" indicate inverted meaning between pre Solr 10 a | |||
|
|
|||
| |solr.cloud.prep.recovery.read.timeout.additional.ms|prepRecoveryReadTimeoutExtraWait|8000|Specifies additional milliseconds to wait during recovery read operations in SolrCloud mode. | |||
|
|
|||
| |solr.cloud.startup.delete.unknown.cores.enabled|solr.deleteUnknownCores|false|Controls whether unknown cores are deleted at startup in SolrCloud mode. | |||
| |solr.cloud.delete.unknown.cores.enabled|solr.deleteUnknownCores|false|Determines if unknown cores should be removed at startup and during collection and replica creation in SolrCloud mode. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I view these things as separate:
- Deleting unexpected cores (according to ZK state.json as truth) on startup.
- Deleting unexpected preexisting core data on replica creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it another stab...
solr/core/src/test/org/apache/solr/cloud/DeleteCoreRemnantsOnCreateTest.java
Outdated
Show resolved
Hide resolved
|
Responded to all the feedback, I think this is ready to go in the next few days! |
| NOTE: In previous version of Solr, the `min_rf` parameter had to be specified to ask Solr for the achieved replication factor. | ||
| Now it is always included in the response. | ||
|
|
||
| == Solr Core Lifecycle Inconsistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate you want to document this, and that this page isn't a terrible/obviously-wrong choice, but IMO it further degrades a very important page that should be more focused on how key aspects of SolrCloud works.
Perhaps what we should do is log a warning with this suggestion that will change. Maybe also reference the JIRA issue where an interested user can get more background / advice. After all, it's an obscure matter. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
humm... log a warning if we bump into the core remnant problem and you haven't enabled the feature, and reference that JIRA? That could work...
What do you think about a FAQ style page or just a new page about "tips and tricks with SolrCloud", would that be generic enough that this and other bits of advice could go there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we have a number of tips & tricks, it'd be a pretty lame FAQ. And the number of such that might be applicable could be vast for Solr. And I wonder if anyone would bother to actually read the FAQ.
Another suitable page might be the collection creation or replica creation APIs.
But my preference remains for logging a warning with a reference to the setting. It's not either-or.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I moved the text to the create collection, and then I added a warning from solr when the situation is hit. Thanks!
|
|
||
| final boolean deleteUnknownCores = | ||
| EnvUtils.getPropertyAsBool("solr.cloud.delete.unknown.cores.enabled", false); | ||
| if (deleteUnknownCores && Files.exists(cd.getInstanceDir())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this logic be adjacent to the logic that fails in CoreProperitesLocator.exists? (I assume that's the spot). Doing so would be better organized and probably lead to combining in a more clear way with coordinated logging and existence check. I forget if this was discussed already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe... Honestly, in this PR, I was attempting to make minimal changes to CoreContainer as it's a beast already and NOT an area i have worked much in the past. I don't have a good overall understanding of it as this PR is my first time actually touching it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CorePropertiesLocator was the beast (not CoreContainer), that would explain why you did what you did, but this is the inverse situation, and you chose to touch CoreContainer when you could have avoided it. Let's move it then; ehh? Any way, I'm here for peer review so don't be afraid ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me two tries to understand your point, and then it clicked.... I made the change, and am running tests and then will push up...
dsmiley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still these this should be a separate setting than combined but I won't push the matter any further.
| SolrCore.deleteUnloadedCore(cd, true, true); | ||
| } else { | ||
| log.warn( | ||
| "Directory at [{}] for core[{}] already exists preventing create operation. Set solr.cloud.delete.unknown.cores.enabled=true to delete directory. (SOLR-18008)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change "preventing" to "may prevent". As I look at the logic, it appears an empty dir may work since the code following checkForExistingCore only looks for core.properties existing.
| checkForExistingCore(cd); | ||
| Path propertiesFile = cd.getInstanceDir().resolve(PROPERTIES_FILENAME); | ||
| if (Files.exists(propertiesFile)) | ||
| throw new SolrException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine but I want to clarify/remind that my last review suggested the move of the logic into here could be an opportunity to integrate with the existing logic. It means triggering the logs/errors based on core.properites instead of the parent dir (seems fine). Any way, this is fine if you wish.
https://issues.apache.org/jira/browse/SOLR-18008
Description
I've added some tests that simulate what happens if you have remnant core files that collide with new core creation.
The first scenario is when you call a create collection command, and on one of the cores, there already is remnant remaining.
The second scenario is when you call a add replica to a collection command, and the replica is attempting to be created where a remnant exists.
The third scenario is a test of CAN you manually delete a core remnant by using the Unload Core direct command. Turns out you can't because Solr checks for a core descriptor, and since it doesn't exist in ZK, it won't continue. I suppose you could use the setting to ignore that check and allow it to go through? However, scenarios one and two cover it.
Solution
Look at what changes to existing logic are required to deal with remnants existing. I'm not trying to figure out why we get remnants, as I don't have visibility into how it happens at this time.
Two scenarios that I've seen are adding a new collection where a core remnants already exists and adding a replica where the core remnants already exists.
Tests
Added a Junit test that demonstrates with and without the parameter. I wanted both in case at some point in the future the default behavior changes when it comes to remnants. I will delete the BATS test before merging.