Skip to content

Conversation

@stefanhengl
Copy link
Member

@stefanhengl stefanhengl commented Aug 20, 2025

We want Zoekt and Sourcegraph to use the same language package. In this PR we move the languages package from Sourcegraph to Zoekt, so that Zoekt can use it and Sourcegraph can import it.

Notes:

  • Zoekt doesn't need to fetch content async which is why I added a little helper func GetLanguagesFromContent to make the call sites in Zoekt less awkward.
  • Sourcegraph's languages package always classified .cls files as Apex, while Zoekt did a content based check. With this PR we follow Zoekt's approach. Specifically, I removed .cls from unsupportedByEnryExtensionToNameMap. I added an additional unit test to cover this case.

Test plan:
I appended the test cases from the old Zoekt languages packages to the tests I copied over from Sourcegraph

@stefanhengl stefanhengl changed the title languages: migrate languages package from Sourcegraph to Zoekt copy languages package from Sourcegraph to Zoekt Aug 20, 2025
@stefanhengl stefanhengl marked this pull request as ready for review August 20, 2025 12:25
Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one minor comment.

Comment on lines 411 to 414
langs := languages.GetLanguagesFromContent(doc.Name, nil)
if len(langs) > 0 {
doc.Language = langs[0]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify the logic a bit be something like:

var content []byte
if doc.SkipReason == SkipReasonNone {
    content = doc.Content
}
langs := languages.GetLanguagesFromContent(doc.Name, content)
if len(langs) > 0 {
    doc.Language = langs[0]
}

@keegancsmith
Copy link
Member

One minor request: can you add a package docstring. Given this is now an exported api seems useful to ensure everything looks good via go doc -all ./languages. Additionally I notice that NormalizeLanguage has no docstring.

@stefanhengl stefanhengl force-pushed the stefan-splf-1349-migrate-languages-package-from-sg-to-zoekt branch from 262c82b to 2dc96a0 Compare August 21, 2025 09:49
@stefanhengl stefanhengl merged commit fb492e2 into main Aug 21, 2025
8 checks passed
@stefanhengl stefanhengl deleted the stefan-splf-1349-migrate-languages-package-from-sg-to-zoekt branch August 21, 2025 10:02
minhna1112 added a commit to minhna1112/zoekt that referenced this pull request Oct 16, 2025
* bump go-ctags for handling non-fatal startup errors (sourcegraph#972)

Currently if you switch to universal-ctags 6.2.0 it will log a few
warnings on startup. The old version of go-ctags would error out. Now
instead it will log to its info log.

Test Plan: Added the problematic file in the linked issue to ./foo/repo
directory. Then ran "go run ./cmd/zoekt-index -require_ctags -index
./foo/index ./foo/repo" with uctags 6.2.0.

* switch to github.com/sourcegraph/cody-public-snapshot for e2e test (sourcegraph#974)

e2e tests were failing because github.com/sourcegraph/cody repository has been set to private and all tags were removed

Test plan:
CI

* sourcegraph: remove GRPC index methods (sourcegraph#977)

This was part of an effort to move the queue from Zoekt to Sourcegraph.
However, we are not going to pursue this for now and we can remove
the corresponging grpc methods.

Sourcegraph never had callers of Index and Delete, so both are save to remove.

`DeleteAllData` is still being called from Sourcegraph.

Test plan:
CI

* chore: fix grpc drift (sourcegraph#978)

In sourcegraph#962 a new field metadata
was added to zoekt.Repository, but it was not added to the grpc
types and converter methods, which is why the fuzz test occasionally
fails.

Test plan:
CI

* copy languages package from Sourcegraph to Zoekt (sourcegraph#979)

We want Zoekt and Sourcegraph to use the same language package. In this PR we move the languages package from Sourcegraph to Zoekt, so that Zoekt can use it and Sourcegraph can import it.

Notes: 
- Zoekt doesn't need to fetch content async which is why I added a little helper func `GetLanguagesFromContent` to make the call sites in Zoekt less awkward.
- Sourcegraph's languages package always classified .cls files as Apex, while Zoekt did a content based check. With this PR we follow Zoekt's approach. Specifically, I removed .cls from `unsupportedByEnryExtensionToNameMap`. I added an additional unit test to cover this case.

Test plan:
I appended the test cases from the old Zoekt languages packages to the tests I copied over from Sourcegraph

* gitindex: move package from internal (sourcegraph#981)

Co-authored-by: ARyzhikov <ARyzhikov@nota.tech>

* Cache docMatchTree results for Meta conditions (sourcegraph#982)

* go mod tidy (sourcegraph#983)

* Add support for indexing submodules without repocache (sourcegraph#985)

* Fix git.Open() fs argument for .git dir instead of worktree

The filesystem argument submitted to git.Open() is of the .git
directory, it should be for the worktree, as in go-git this logic is copied from:

https://github.com/go-git/go-git/blob/145daf2492dd86b1d7e7787555ebd9837b93bff8/repository.go#L373-L375

One side effect of this bug was that go-git would refuse to fetch repo
submodules

* Fix missing RepoURLs and LineFragments for result subrepositories

RepoURLs for subrepositories are currently filtered out when results
are streamed per-repo from shards.

* Add submodule indexing without repo cache

When not using repo cache, index git submodules recursively into

subrepos using the go-git API

* Adjust subrepo naming conventions

- Revert repo-cache behavior to unchanged
- Add support for naming repos and modules with no remote (use the dirname for the
  former and subrepopath for the latter)
- Add test for submodule no-repo-cache behavior

* Fix overallocation in repoURL and lineFragment filtering

* Revert basename treatment from 08f6759

Missed the fact that it is already done in zoekt-git-index
command, the test was improperly configured

https://github.com/sourcegraph/zoekt/blob/4e4a529c3b63c7d4c7897ba736f1cd52cc163134/cmd/zoekt-git-index/main.go#L93-L100

---------

Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Co-authored-by: Aleksander Ryzhickov <makseljoinb@gmail.com>
Co-authored-by: ARyzhikov <ARyzhikov@nota.tech>
Co-authored-by: John Mason <john@johnmason.io>
Co-authored-by: Phillip Dykman <Phil.d324@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants