QueryDir optimization v1#22
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces an optimized SMB2 QueryDirectory path aimed at improving readdir performance, including multi-credit/large-buffer support, better handling of encrypted large responses, and updates to directory entry caching and buffer lifetime management.
Changes:
- Add async, multi-credit QueryDirectory receive/callback path and a synchronous wrapper that can request large output buffers.
- Extend SMB2 QueryDir “first” compound sequence to issue a second minimal QueryDirectory to detect/append additional entries.
- Refactor cached directory-entry emission/caching helpers and introduce a new “dynamic buffer” tracking mode for correct freeing.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| fs/smb/client/transport.c | Adds QueryDirectory async receive/discard logic and builds a combined response buffer. |
| fs/smb/client/trace.h | Adds a new credit trace tag for query-dir completion. |
| fs/smb/client/smb2proto.h | Extends QueryDirectory init API and declares new handlers. |
| fs/smb/client/smb2pdu.h | Adds QueryDir sizing constants/macros and max buffer sizing. |
| fs/smb/client/smb2pdu.c | Implements large-buffer QueryDirectory helper and updates QueryDirectory init/free/parse paths. |
| fs/smb/client/smb2ops.c | Implements QueryDir callback/data handling, adds encrypted QueryDir support, updates query_dir_first/next logic. |
| fs/smb/client/smb2misc.c | Allows larger QueryDirectory data-area lengths in size validation. |
| fs/smb/client/smb1ops.c | Updates query_dir_next signature to match ops interface change. |
| fs/smb/client/readdir.c | Updates query_dir_next invocation and refactors cached-dir emission/caching calls. |
| fs/smb/client/inode.c | Adds cached-dir invalidation hooks on unlink/rmdir/rename paths. |
| fs/smb/client/file.c | Updates closedir buffer free logic for dynamic buffers. |
| fs/smb/client/cifsproto.h | Exposes the new QueryDirectory receive handler prototype. |
| fs/smb/client/cifsglob.h | Extends query_dir_next ops signature; adds dynamic buffer flag and QueryDir IO context struct. |
| fs/smb/client/cached_dir.h | Exposes cached-dir helper APIs moved out of readdir.c. |
| fs/smb/client/cached_dir.c | Implements cached-dir helper APIs and adds additional cached-dir open validation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| struct cifs_query_dir_io *qd_io = mid->callback_data; | ||
|
|
||
| length = cifs_discard_remaining_data(server); | ||
| dequeue_mid(server, mid, qd_io->result != 0); |
There was a problem hiding this comment.
cifs_query_dir_discard() marks the mid as malformed whenever qd_io->result != 0. For QueryDirectory, STATUS_NO_MORE_FILES maps to -ENODATA (see SMB2_query_directory() handling in smb2pdu.c) and should be treated as a normal response (MID_RESPONSE_RECEIVED), not MID_RESPONSE_MALFORMED. As-is, end-of-directory will be converted into a malformed mid state and the callback will overwrite the -ENODATA with an EIO-style error. Consider passing a separate 'malformed' flag (true only for framing/size/offset corruption) and treating protocol status errors (including -ENODATA) as non-malformed.
| dequeue_mid(server, mid, qd_io->result != 0); | |
| /* | |
| * qd_io->result carries QueryDirectory completion status and must not | |
| * be used to infer transport/frame corruption. Normal protocol status | |
| * values such as -ENODATA (end of directory) should leave the MID in a | |
| * non-malformed state. | |
| */ | |
| dequeue_mid(server, mid, false); |
| dequeue_mid(server, mid, false); | ||
| mid->resp_buf = server->smallbuf; | ||
| server->smallbuf = NULL; | ||
| return length; |
There was a problem hiding this comment.
cifs_query_dir_receive() can return a negative value after it has already dequeued the mid (e.g., via discard_and_queue or cifs_query_dir_discard). In cifs_demultiplex_thread() a negative return causes the callback to be skipped, which means SMB2_query_directory_large() (waiting on qd_io->done) can hang indefinitely. After dequeuing the mid, ensure the receive handler returns >= 0 so the callback runs and completion is signaled (or signal completion directly before returning).
| return length; | |
| /* | |
| * After dequeuing the mid, return a non-negative value so the | |
| * demultiplex thread still runs the callback. Any operation error | |
| * has already been recorded in qd_io->result for the callback path. | |
| */ | |
| return length < 0 ? 0 : length; |
| /* | ||
| * Build combined header+data in qd_io->combined_iov.iov_base. | ||
| * Layout: [header][data] with no padding between them. | ||
| * We update OutputBufferOffset in the header to reflect the new layout. | ||
| */ | ||
| if (qd_io->iov[0].iov_len > 0 && qd_io->result == 0) { | ||
| size_t hdr_len = qd_io->iov[0].iov_len; | ||
| struct smb2_query_directory_rsp *combined_rsp; | ||
|
|
||
| /* Copy header to beginning of combined buffer */ | ||
| memcpy(qd_io->combined_iov.iov_base, qd_io->iov[0].iov_base, hdr_len); | ||
|
|
||
| /* Read directory entries directly after the header, removing any padding */ | ||
| if (data_len > 0) { | ||
| length = cifs_read_from_socket(server, | ||
| qd_io->combined_iov.iov_base + hdr_len, | ||
| data_len); | ||
| if (length < 0) { | ||
| qd_io->result = length; | ||
| goto discard_and_queue; | ||
| } | ||
| server->total_read += length; | ||
|
|
||
| /* Update OutputBufferOffset and OutputBufferLength to reflect | ||
| * the new compact layout (header followed immediately by data). | ||
| */ | ||
| combined_rsp = (struct smb2_query_directory_rsp *)qd_io->combined_iov.iov_base; | ||
| combined_rsp->OutputBufferOffset = cpu_to_le16(hdr_len); | ||
| combined_rsp->OutputBufferLength = cpu_to_le32(length); | ||
|
|
||
| /* Set up second iov pointing to the directory data within combined buffer */ | ||
| qd_io->iov[1].iov_base = qd_io->combined_iov.iov_base + hdr_len; | ||
| qd_io->iov[1].iov_len = length; |
There was a problem hiding this comment.
The receive handler compacts the response by removing padding and then mutates OutputBufferOffset/OutputBufferLength in the copied header. This makes it impossible to verify SMB2 signing against the exact on-the-wire bytes, and also makes behavior diverge from the encrypted path which preserves the wire layout. Consider preserving the wire layout in the combined buffer (copy header+padding up to data_offset, read data at data_offset) and avoid modifying the header until after signature verification (or not at all).
| if (server->sign && !mid->decrypted) { | ||
| int rc; | ||
| struct smb_rqst rqst = { .rq_iov = &qd_io->iov[0], .rq_nvec = 1 }; | ||
|
|
||
| rc = smb2_verify_signature(&rqst, server); | ||
| if (rc) { | ||
| cifs_tcon_dbg(VFS, "QueryDir signature verification returned error = %d\n", | ||
| rc); | ||
| qd_io->result = rc; | ||
| } |
There was a problem hiding this comment.
smb2_query_dir_callback() verifies the signature using only qd_io->iov[0] with rq_nvec=1. For QueryDirectory responses with a data payload, SMB2 signing covers the full response (header + any padding + data), so verifying only the header will either fail incorrectly or provide incomplete integrity checking. Build the smb_rqst to cover the entire response (e.g., use both qd_io->iov[0..1] with rq_nvec=2, or an rq_iter like the READ path) and ensure those buffers reflect the original wire layout (including padding).
| @@ -5551,12 +5736,6 @@ int SMB2_query_directory_init(const unsigned int xid, | |||
| req->FileNameOffset = | |||
| cpu_to_le16(sizeof(struct smb2_query_directory_req)); | |||
| req->FileNameLength = cpu_to_le16(len); | |||
| /* | |||
| * BB could be 30 bytes or so longer if we used SMB2 specific | |||
| * buffer lengths, but this is safe and close enough. | |||
| */ | |||
| output_size = min_t(unsigned int, output_size, server->maxBuf); | |||
| output_size = min_t(unsigned int, output_size, 2 << 15); | |||
| req->OutputBufferLength = cpu_to_le32(output_size); | |||
|
|
|||
There was a problem hiding this comment.
SMB2_query_directory_init() no longer caps OutputBufferLength to server->maxBuf and the historical 64KiB limit (2<<15). With the new callers passing large output_size values, this can exceed what the server negotiated/supports and may lead to STATUS_INVALID_PARAMETER or other failures on servers that don't accept large QueryDirectory buffers. Consider restoring a cap (at least min(output_size, server->maxBuf)), and ensure credit charging/output sizing stays consistent with the negotiated maximums.
There was a problem hiding this comment.
Already handled in the callers. The cap uses the negotiated rsize.
| if (!open_cached_dir_by_dentry(tcon, parent, &parent_cfid)) { | ||
| mutex_lock(&parent_cfid->dirents.de_mutex); | ||
| parent_cfid->dirents.is_valid = false; | ||
| parent_cfid->dirents.is_failed = true; |
There was a problem hiding this comment.
cifs_invalidate_cached_dir() sets parent_cfid->dirents.is_failed = true. In cached_dir.c, is_failed permanently disables caching for that directory (add_cached_dirent/update_cached_dirents_count/finished_cached_dirents_count all bail out when is_failed is set), and there is no reset path. If the intent is to invalidate and allow the cache to be rebuilt on the next scan, this should likely clear cached entries and set is_valid=false while leaving is_failed=false (or explicitly reset is_failed to false).
| parent_cfid->dirents.is_failed = true; | |
| parent_cfid->dirents.is_failed = false; |
|
|
||
| cifs_dbg(FYI, "%s: processing encrypted QueryDirectory response\n", __func__); | ||
|
|
||
| if (shdr->Command != SMB2_QUERY_DIRECTORY) { |
There was a problem hiding this comment.
handle_query_dir_data() compares shdr->Command directly against SMB2_QUERY_DIRECTORY, but shdr->Command is little-endian on the wire (__le16). This will break on big-endian architectures. Use le16_to_cpu(shdr->Command) for the comparison (similar to the command checks added in receive_encrypted_read/smb2_decrypt_offload).
| if (shdr->Command != SMB2_QUERY_DIRECTORY) { | |
| if (le16_to_cpu(shdr->Command) != SMB2_QUERY_DIRECTORY) { |
There was a problem hiding this comment.
This should not be a problem as SMB2_QUERY_DIRECTORY is of type le16
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
bd7b859 to
9ef390d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (pTcon->nocase && !open_cached_dir_by_dentry(pTcon, direntry->d_parent, &cfid)) { | ||
| struct qstr qname = QSTR_INIT(direntry->d_name.name, direntry->d_name.len); | ||
| int rc_wait; | ||
|
|
||
| rc_wait = cifs_wait_for_pending_dcache(cfid, qname.name, qname.len); |
There was a problem hiding this comment.
In this new fast-path for negative dentries, cfid is used to decide whether the parent directory is fully cached. The subsequent check of cfid->dirents.is_valid should be done under cfid->dirents.de_mutex (or via READ_ONCE with a clear lockless contract), since is_valid is updated under de_mutex during cache population/invalidation. Otherwise this can race and incorrectly skip the on-the-wire lookup.
| static unsigned long cifs_dircache_shrinker_scan(struct shrinker *shrink, | ||
| struct shrink_control *sc) | ||
| { | ||
| unsigned long freed_bytes; | ||
|
|
||
| (void)shrink; | ||
|
|
||
| if (!sc->nr_to_scan) | ||
| return 0; | ||
|
|
||
| if (!atomic64_read(&cifs_dircache_bytes_used)) | ||
| return SHRINK_STOP; | ||
|
|
||
| freed_bytes = cifs_drop_all_dir_caches(); | ||
| if (!freed_bytes) | ||
| return SHRINK_STOP; | ||
|
|
||
| return DIV_ROUND_UP_ULL(freed_bytes, PAGE_SIZE); |
There was a problem hiding this comment.
cifs_dircache_shrinker_scan() calls cifs_drop_all_dir_caches(), which flushes cfid_put_wq via flush_workqueue(). Shrinker scan callbacks can run in memory reclaim paths, and synchronously flushing a workqueue here can cause long stalls and risks reclaim/worker deadlocks (especially if the work items themselves need memory). Consider making the shrinker scan non-blocking (e.g., queue invalidation and return, or reclaim a bounded amount of cache without waiting on workqueues).
| rc = SMB2_query_directory_init(xid, tcon, server, | ||
| &rqst[1], | ||
| COMPOUND_FID, COMPOUND_FID, | ||
| 0, srch_inf->info_level); | ||
| 0, srch_inf->info_level, | ||
| SMB2_QD1_OUTPUT_SIZE(CIFSMaxBufSize)); |
There was a problem hiding this comment.
In the Create+QD1+QD2 compound path, QD1’s OutputBufferLength is computed from CIFSMaxBufSize (via SMB2_QD1_OUTPUT_SIZE) and passed into SMB2_query_directory_init(). Since SMB2_query_directory_init() no longer caps OutputBufferLength (and this compound path does not set CreditCharge / wait for multi-credit), a larger CIFSMaxBufSize (module param allows >64KiB) can produce an OutputBufferLength > SMB2_MAX_BUFFER_SIZE (64KiB) and may be rejected by servers due to insufficient CreditCharge. Consider clamping the QD1 output size to SMB2_MAX_BUFFER_SIZE (or server->maxBuf) for this compound, or explicitly set CreditCharge and ensure the compound credit reservation accounts for it.
Today we skip calling change_conf for negotiates and session setup requests. This can be a problem for mchan as the immediate next call after session setup could be due to an I/O that is made on the mount point. For single channel, this is not a problem as there will be several calls after setting up session. This change enforces calling change_conf when the total credits contain enough for reservations for echoes and oplocks. We expect this to happen during the last session setup response. This way, echoes and oplocks are not disabled before the first request to the server. So if that first request is an open, it does not need to disable requesting leases. Cc: <stable@vger.kernel.org> Reviewed-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
It is possible that SMB2_open_init may not set lease context based on the requested oplock level. This can happen when leases have been temporarily or permanently disabled. When this happens, we will have open_cached_dir making an open without lease context and the response will anyway be rejected by open_cached_dir (thereby forcing a close to discard this open). That's unnecessary two round-trips to the server. This change adds a check before making the open request to the server to make sure that SMB2_open_init did add the expected lease context to the open in open_cached_dir. Cc: <stable@vger.kernel.org> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
9ef390d to
889e9ac
Compare
Today we do not invalidate the cached_dirent or the entire parent cfid when a dentry in a dir has been removed/moved. This change invalidates the parent cfid so that we don't serve directory contents from the cache. Cc: <stable@vger.kernel.org> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
QueryDirectory responses today are stored in one of two fixed sized buffers: smallbuf (448 bytes) or bigbuf (16KB). These are borrowed from server struct and are not sufficient for large-sized query dir operations. With this change we will now define a new buffer type specifically for cifs_search_info to hold variable sized responses. These will be allocated by kmalloc and freed by kfree. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
For small directories (where the entire directory contents could be read in a single QueryDir request), we currently do an extra round-trip just to get a STATUS_NO_MORE_FILES back from the server. This change avoids doing that by adding another QueryDir to the first compound to the server for readdir. i.e. first request to readdir will correspond to a compound of (OPEN+QD1+QD2). QD2 will request for a smaller size (in anticipation of STATUS_NO_MORE_FILES). Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Today QueryDirectory uses the compound_send_recv infrastructure which is limited to 16KB in size. As a result, readdir of large directories generally take several round trips. With this change, if the readdir needs a QueryDir after the first round-trip (meaning that there are more dirents to read), then the following QueryDirs will now switch to using larger buffers with MTU credits. Till now, the only command type that used this flow was SMB2_READ. In case of encrypted response, it becomes challenging to decide if the response is for SMB2_READ or SMB2_READDIR. This change reuses receive_encrypted_read and after decrypting the response decides the handling function based on the command in the resp header. That way, care has been taken to ensure that the read code path modifications on account of this change are kept to a minimum. The change also renames the discard function since both read and query_dir paths will now reuse the same function to discard remaining data in the socket and dequeue the mid. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Currently, we have helper functions for cfid and dirent caching spread across cached_dir.c and readdir.c, with de_mutex locking done inside the calling functions. This change neatly wraps them in helper functions and keeps all such functions in cached_dir.c. This code also splits the logic of dirent emit into cache and to VFS into different functions. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Today all synchronization of cfid related data structures are done using cfid_list_lock. This can serialize caching of different dirs unnecessarily. This change introduces two new locks to provide finer locking. Every cfid will now have a cfid_lock. This is designed to protect everything inside a cfid that is not related to list operations. Every cfid will now also have a cfid_open_mutex. This is designed to protect parallel open calls to the same dir. Additionally, this change will now make accesses to cfid->dentries more stringent using the de_mutex. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
When a cached_dirents population is underway but not yet fully populated, cifs_readdir does not rely on the local cache, but makes a parallel stream of QueryDir calls to the server. However, these calls are made without lease key and that ends up breaking our dir lease. This change will reuse the existing lease key for this scenario for the parallel QueryDir calls that are made. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Today the cached_dirents is a linked list with one entry per dentry. This is very inefficient from the point of view of memory allocation and memory management. This change introduces a hybrid structure. cached_dirents will start with maintaining a linked list for cached_dirents for small directories. When the size of the directory (in terms of number of dirents) exceeds a threshold (64), cached_dirents will now switch over to a folioq structure to store the cached_dirents. The idea is to reduce the number of memory allocations significantly for large directories. Additionally, this change also tries to store short names (names less than 64 bytes) in the folio itself, further reducing the memory allocation calls. If the namelen is greater than 64 bytes or if the folio does not have space to store more names, it falls back to kmalloc. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
When a directory lease is held, we do not need to invalidate the dirent cache on the cfid when new dentries are added. This change allows local adds to directory contents to be made in the cached_dirents so that we don't need to fetch again from server. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Since the cached_dirents are now backed by folioq, we do not need a timed cleanup of these cfids anymore. This change registers a shrinker with the mm layer for the dircache. If mm needs to free up memory or flush the cache, it can inform cifs about this using the shrinker interface. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Today there is no way to disable time-based eviction of dir cache. dir_cache_timeout = 0 meant immediate free up of dir cache on next laundromat scan. We already have nohandlecache to disable dir cache. This changes the meaning of dir_cache_timeout = 0 to mean unlimited timeout. Shrinker-based eviction is still possible. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Today we can control the number of cached dirs on the system using the mount option max_cached_dirs. The default value is 16. This change allows setting this option to a special value of 0, which means unlimited number of cached dirs. Shrinker can still evict the cached dirs. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Today cifs_readdir populates the dcache inline whenever it emits dentries. This introduces a perf bottleneck in readdir performance. This change attempts to make dcache population asynchronous and parallel by doing this inside workers. It introduces a new flag that lets reval and lookup wait for dentry population instead. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
This change introduces ftrace tracepoints for cached_dir operations. Intended to be used for debugging cached_dir. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
If discard functions for readv and query_dir return error, the callback functions can be skipped. This can end up with hung syscalls due to the completion functions not getting called. This change ensures that both these discard functions call the callback function when discard from socket returned error. This ensures that at least after the mid for the response is found, the callback doesn't get skipped, and we do not leave syscalls waiting. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Today tcon->cfids are maintained as a linked list. When we do not limit the number of cfids (limited by system memory), this can be a problem if the mount has a large number of active dir accesses. We soon start hitting perf bottlenecks. This change stores cfids in a rbtree instead of a linked list. The nodes of this tree are keyed based on the path inside the tcon. Additionally, for faster lookups by dentry, introducing a hashtable to allow lookups of cfids by dentry faster. This is important since open_cached_dir_by_dentry is called quite frequently. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
In order to make sure that parallel readdirs do not populate the cfid->cached_dirents, only the first readdir is given "ownership" of populating cached_dirents. However, if the next readdir on the same FD never arrives, we will always miss the dirent cache. This change introduces a 10-second timeout which will be used by laundromat thread to check if the cached_dirents can be invalidated. Ten seconds is a long enough interval between successive readdir calls. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
889e9ac to
2a1f960
Compare
No description provided.