diff --git a/docs/4. repository/1. sourcify-database.mdx b/docs/4. repository/1. sourcify-database.mdx
index 7baf810..6fa39e8 100644
--- a/docs/4. repository/1. sourcify-database.mdx
+++ b/docs/4. repository/1. sourcify-database.mdx
@@ -1,6 +1,6 @@
# Sourcify Database
-Sourcify Database is the main storage backend for Sourcify. It is a PostgreSQL database that follows the [Verified Alliance Schema](https://github.com/verifier-alliance/database-specs) as its base with few modifications.
+Sourcify Database is the main storage backend for Sourcify. It is a PostgreSQL database that follows the [Verifier Alliance Schema](https://github.com/verifier-alliance/database-specs) as its base with few modifications.
On a high level, these modifications are:
@@ -65,84 +65,12 @@ Other known inconsistencies in the data below (not planned to fix) are documente
## Download
-:::warning Deprecation Notice
-The current parquet download format will be deprecated. A new `/v2` endpoint will be introduced with an updated format. Documentation for the new format will be added once it is live. Feel free to use the export in its current form, but be aware that it will be replaced.
-:::
+See [Download the Dataset](/docs/repository/download-dataset/) for instructions on downloading the database in Parquet format.
-We dump the whole database daily in [Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) format and upload it to a Cloudflare R2 storage. You can access the manifest file at https://export.sourcify.dev ( `.dev` redirects to `.app` domain, which also belongs to Sourcify). The script that does the dump is at [sourcifyeth/parquet-export](https://github.com/sourcifyeth/parquet-export).
-
-[export.sourcify.dev](https://export.sourcify.dev) will redirect to a `manifest.json` file:
-
-
-manifest.json
-
-```json
-{
- "timestamp": 1726030203254,
- "dateStr": "2024-09-11T04:50:03.254904Z",
- "files": {
- "code": [
- "code/code_0_100000.parquet",
- "code/code_100000_200000.parquet",
- ...
- "code/code_2700000_2800000.parquet"
- ],
- "contracts": [
- "contracts/contracts_0_1000000.parquet",
- ...
- "contracts/contracts_4000000_5000000.parquet"
- ],
- "contract_deployments": [
- "contract_deployments/contract_deployments_0_1000000.parquet",
- ...
- "contract_deployments/contract_deployments_5000000_6000000.parquet"
- ],
- "compiled_contracts": [
- "compiled_contracts/compiled_contracts_0_5000.parquet",
- ...
- "compiled_contracts/compiled_contracts_815000_820000.parquet"
- ],
- "verified_contracts": [
- "verified_contracts/verified_contracts_0_1000000.parquet",
- ...
- "verified_contracts/verified_contracts_5000000_6000000.parquet"
- ],
- "sourcify_matches": [
- "sourcify_matches/sourcify_matches_0_100000.parquet",
- ...
- "sourcify_matches/sourcify_matches_5300000_5400000.parquet"
- ]
- }
-}
-```
-
-
-
-You can download all the files and use a parquet client to query, inspect, or process the data.
-
-1. Download the manifest file (`-L` to follow redirects):
-
- ```bash
- curl -L -O https://export.sourcify.dev/manifest.json
- ```
-
-2. Download all the tables listed in the manifest:
- ```bash
- jq -r '.files | keys[] as $k | .[$k][]' manifest.json | xargs -I {} curl -L -O https://export.sourcify.dev/{}
- ```
-
-For example you can install the [`parquet-cli`](https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md) to do basic inspection:
-
-```bash
-brew install parquet-cli
-
-parquet meta compiled_contracts_0_5000.parquet
-```
-
-alternatively use your favorite data processing tool or import this data into a database.
-
-## BigQuery Datasets
+## BigQuery Dataset
We also provide a public BigQuery dataset for convenient querying and exploration:
-[Sourcify production dataset](https://console.cloud.google.com/bigquery/analytics-hub/exchanges/projects/1019539084286/locations/europe-west1/dataExchanges/sourcify_19a0c79ef3a/listings/sourcify_19a0c7d0be2?project=tranquil-petal-125711)
+[Sourcify BigQuery dataset](https://console.cloud.google.com/bigquery/analytics-hub/exchanges/projects/1019539084286/locations/europe-west1/dataExchanges/sourcify_19a0c79ef3a/listings/sourcify_19a0c7d0be2?project=tranquil-petal-125711)
+
+The dataset is updated continuously as new contracts are verified. You need a Google account to access it.
diff --git a/docs/4. repository/2. download-dataset.mdx b/docs/4. repository/2. download-dataset.mdx
new file mode 100644
index 0000000..ca2c56e
--- /dev/null
+++ b/docs/4. repository/2. download-dataset.mdx
@@ -0,0 +1,146 @@
+# Download the Dataset
+
+:::warning
+
+The previous Parquet export format v1 is now deprecated. See the [note](/docs/repository/download-dataset/#legacy-format-v1) below. Please follow the [instructions](/docs/repository/download-dataset/#export-v2-format) for the new v2 format.
+
+:::
+
+The entire Sourcify Database is exported continuously as [Parquet](https://github.com/apache/parquet-format) files, a modern columnar data format. Parquet files are compressed, efficient to query, and widely supported by data tools. ([Quick tutorial](https://www.datacamp.com/tutorial/apache-parquet)).
+
+The export is hosted on Google Cloud Storage and accessible via an S3-compatible API at [export.sourcify.dev](https://export.sourcify.dev/). The export is based on the structure of the [Verifier Alliance database export](https://verifieralliance.org/docs/download).
+
+## Export v2 Format
+
+The export format has undergone a redesign to make it more efficient and easier to use. The v2 format follows these principles:
+
+- New data is uploaded **daily**.
+- Each database **table** is stored as a set of Parquet files.
+- Files are partitioned by row ranges and **ordered** by `created_at` timestamps. Exception: the `sourcify_matches` table is ordered by `updated_at` timestamps, please see the [note](/docs/repository/download-dataset/#note-on-sourcify_matches) below.
+- **Append-only** pattern: New data is added to new files; existing files are not modified. Only the most recent file for each table may be updated while it is not full yet.
+- **File metadata** (checksums, sizes, timestamps) is provided directly by the Google Cloud Storage API.
+- Files use **zstd compression** built into the Parquet format.
+
+The dataset is available at [export.sourcify.dev](https://export.sourcify.dev/). All files of the v2 format are stored under the `v2/` prefix.
+
+### Downloading and Syncing the Dataset
+
+To download the entire dataset, you can run this command:
+
+```bash
+curl -s 'https://export.sourcify.dev/?prefix=v2/' | \
+ grep -oP '(?<=)[^<]+' | \
+ xargs -I {} curl -L -O https://export.sourcify.dev/{}
+```
+
+Alternatively, the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#getting-started-install-instructions) makes it easy to download and keep the dataset in sync. The following command downloads the entire dataset on the first run, and on subsequent runs only downloads new or modified files:
+
+```bash
+aws s3 sync s3://sourcify-parquet-export/v2/ ./sourcify-dataset --endpoint-url https://storage.googleapis.com --no-sign-request
+```
+
+### Note on `sourcify_matches`
+
+The `sourcify_matches` table is the only table that is not append-only and can be updated in the underlying Sourcify Database.
+Therefore, its rows are ordered by `updated_at` timestamps when exported.
+This means that rows with the same `id` may appear multiple times in the export files.
+
+When working with the `sourcify_matches` table from the export, please only consider the row with the most recent `updated_at` for each `id` as the current state. For importing the `sourcify_matches` parquet files into a database, an **upsert** operation should be used.
+
+
+### Working with Parquet Files
+
+Once downloaded, you can query and analyze Parquet files using various tools and libraries. Here are some popular options to give you a head start:
+
+- [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html): Read data from Parquet files in Python
+- [DuckDB](https://duckdb.org/docs/data/parquet): SQL queries on Parquet files
+- [pg_parquet](https://github.com/CrunchyData/pg_parquet): PostgreSQL extension for copying Parquet data into a Postgres database
+
+### API
+
+For more fine-grained control, you can browse and download files directly using the S3-compatible Google Cloud Storage API:
+
+**List all v2 files:**
+
+```
+https://export.sourcify.dev/?prefix=v2/
+```
+
+**List files for a specific table:**
+
+```
+https://export.sourcify.dev/?prefix=v2/verified_contracts/
+```
+
+**Download a specific file:**
+
+```
+https://export.sourcify.dev/v2/verified_contracts/verified_contracts_0_1000000.parquet
+```
+
+The API returns XML responses following the [Google Cloud Storage XML API specification](https://cloud.google.com/storage/docs/xml-api/get-bucket-list).
+
+#### Available Tables
+
+The Parquet export is available for all Sourcify Database tables: `sourcify_matches`, `verified_contracts`, `sources`, `compiled_contracts_sources`, `compiled_contracts`, `contract_deployments`, `contracts`, `code`, `compiled_contracts_signatures`, and `signatures`.
+
+#### API Parameters
+
+The most important parameters of the listing API are the following:
+
+- **prefix**: Filter results to objects whose names begin with this prefix (e.g., `?prefix=v2/verified_contracts/`)
+- **marker**: Start listing after this object name (for pagination)
+- **max-keys**: Maximum number of objects to return in one response
+
+The response from the listing API might be truncated, which is indicated by the `IsTruncated` field of the result. The `marker` parameter can be used to paginate through results by setting it to the `NextMarker` of the previous response.
+
+Example with pagination:
+
+```
+https://export.sourcify.dev/?prefix=v2/verified_contracts/&max-keys=2&marker=v2/verified_contracts/verified_contracts_1000000_2000000.parquet
+```
+
+#### Metadata
+
+The listing API provides detailed metadata for each of the Parquet files:
+
+```xml
+
+ sourcify-parquet-export
+ v2/
+
+ false
+
+ v2/code/code_0_100000.parquet
+ 1766065018286394
+ 1
+ 2025-12-18T13:36:58.292Z
+ "ba687acd0afab85ed203a593479f0ce3"
+ 101591414
+
+
+
+```
+
+Most important fields:
+
+- **Key**: The file path (download at `https://export.sourcify.dev/{Key}`)
+- **LastModified**: When the file was last uploaded/modified
+- **ETag**: MD5 hash of the file contents (use this to detect changes)
+- **Size**: File size in bytes
+
+## Legacy Format (v1)
+
+:::warning Deprecation Notice
+
+The v1 Parquet export format is **no longer updated**. All new data is only available in the v2 format. Please migrate to v2 for access to current data.
+
+:::
+
+The legacy v1 format files can still be accessed via non-prefixed paths in the bucket (e.g., `https://export.sourcify.dev/verified_contracts/verified_contracts_0_1000000.parquet`).
+
+The v1 format used a JSON manifest file at [https://export.sourcify.dev/manifest.json](https://export.sourcify.dev/manifest.json) listing all available Parquet files. However, this format was not append-only. Each daily export regenerated all files, requiring users to download the entire dataset again after every update. The manifest also did not include checksums or modification timestamps, making it difficult to determine what changed between exports.
+
+## Export Script
+
+The source code of the export script is available at [https://github.com/sourcifyeth/parquet-export](https://github.com/sourcifyeth/parquet-export).
diff --git a/docs/4. repository/2. file-repositories.mdx b/docs/4. repository/3. file-repositories.mdx
similarity index 83%
rename from docs/4. repository/2. file-repositories.mdx
rename to docs/4. repository/3. file-repositories.mdx
index 1fdc125..1d2c9a4 100644
--- a/docs/4. repository/2. file-repositories.mdx
+++ b/docs/4. repository/3. file-repositories.mdx
@@ -2,14 +2,15 @@ import TotalRepoSize from "./TotalRepoSize"
# File Repositories
-This page describes the `RepositoryV1` and `RepositoryV2`, which are file systems (deprecated).
+:::danger Deprecation Notice
+The file repositories are **deprecated** and no longer supported. The [Sourcify Database](/docs/repository/sourcify-database/) serves as the source of truth now.
-:::warning
-The file repositories are used by the legacy API that is deprecated. Please use APIv2 and the [Database](/docs/repository/sourcify-database/) as the main storage backend.
+Only the IPFS pinning service still uses the logical structure of RepositoryV2 for uploading files to IPFS.
-You can still use RepositoryV2 just to save files to be pinned on IPFS.
+For custom Sourcify instances, we recommend migrating to APIv2 and the [Database](/docs/repository/sourcify-database/) as the main storage backend. See the [migration guide](/docs/database-migration).
:::
+This page describes the `RepositoryV1` and `RepositoryV2`, which are file systems (deprecated).
## Table of Contents
@@ -74,16 +75,15 @@ The files are exactly the same so their IPFS hashes will not change, and you can
## IPFS
-Unfortunatelly publishing under IPNS is temporarily disabled. This is because of the difficulty of managing the whole filesystem over IPFS (with MFS etc.) and updating the IPNS regularly.
+The sources of all verified contracts are pinned on IPFS. The logical structure of RepositoryV2 serves as the basis for uploading to IPFS. Files can be accessed via their individual CIDs (e.g. [`QmVij3h9z536ZG5cRpUmTfdoN9KR1Xp4ix2P7to9dPHgE5`](https://ipfs.io/ipfs/QmVij3h9z536ZG5cRpUmTfdoN9KR1Xp4ix2P7to9dPHgE5)).
-We still pin all the files on IPFS so you can access them over their individual CIDs (e.g. [`QmVij3h9z536ZG5cRpUmTfdoN9KR1Xp4ix2P7to9dPHgE5`](https://ipfs.io/ipfs/QmVij3h9z536ZG5cRpUmTfdoN9KR1Xp4ix2P7to9dPHgE5)).
+Unfortunately, publishing under IPNS is temporarily disabled. This is because of the difficulty of managing the whole filesystem over IPFS (with MFS etc.) and updating the IPNS regularly.
-Look at the [Download section](#download) to learn how to download the whole repository.
## Download
:::danger No New Exports
-Following deprecating the filesystem based repositories, **we no longer publish new exports**. We recommend resorting to the [Parquet exports](/docs/repository/sourcify-database/#download) instead.
+Following deprecating the filesystem based repositories, **we no longer publish new exports**. We recommend resorting to the [Parquet exports](/docs/repository/download-dataset/) instead.
You can still download the existing export for a while. Double check the date of the export in the manifest file. If you need these exports please reach out to us.
:::
diff --git a/docs/4. repository/3. signature-database.mdx b/docs/4. repository/4. signature-database.mdx
similarity index 88%
rename from docs/4. repository/3. signature-database.mdx
rename to docs/4. repository/4. signature-database.mdx
index 6a3ac37..3c456ea 100644
--- a/docs/4. repository/3. signature-database.mdx
+++ b/docs/4. repository/4. signature-database.mdx
@@ -6,5 +6,5 @@ The data is stored in the same database as the verified contracts. The `signatur
- **Schema**: Check [docs/repository/sourcify-database/#schema](/docs/repository/sourcify-database/#schema) for the schema.
- **API**: Check [docs/api/](/docs/api/) for the API.
-- **Download**: You can download the related tables in Parquet format from [export.sourcify.dev](https://export.sourcify.dev). See [/docs/repository/sourcify-database/#download](/docs/repository/sourcify-database/#download) for more details.
+- **Download**: You can download the related tables in Parquet format from [export.sourcify.dev](https://export.sourcify.dev). See [Download the Dataset](/docs/repository/download-dataset/) for more details.
- **Playground**: Visit [4byte.sourcify.dev](https://4byte.sourcify.dev) to search for signatures.
\ No newline at end of file
diff --git a/docs/4. repository/index.mdx b/docs/4. repository/index.mdx
index 7b6e94c..8599d19 100644
--- a/docs/4. repository/index.mdx
+++ b/docs/4. repository/index.mdx
@@ -1,16 +1,13 @@
-# Contract Repository
+# Contract Dataset
-Sourcify stores the contracts in multiple storage backends and gives the option to choose which one to use. In short there are the following options:
+Sourcify stores all verified contracts in multiple storage backends and gives multiple options to access the dataset. In short, there are the following options:
-- `RepositoryV1`
-- `RepositoryV2`
-- `SourcifyDatabase`
-- `AllianceDatabase`
+- **Sourcify Database**: Sourcify's source of truth, a [postgres database](/docs/repository/sourcify-database). Accessible via the [API](/docs/api/) and the [Repo UI](https://repo.sourcify.dev/).
+- **Verifier Alliance Database**: Shared database with other verification services. See the [Verifier Alliance](https://verifieralliance.org/) website for more info.
+- **BigQuery**: For convenience, the Sourcify dataset is uploaded to [BigQuery](/docs/repository/sourcify-database/#bigquery-datasets).
+- **IPFS**: The sources of all verified contracts are pinned on [IPFS](/docs/repository/file-repositories/#ipfs).
-For details see [Choosing the storage backend](https://github.com/argotorg/sourcify/tree/staging/services/server#choosing-the-storage-backend).
## Download
-You can download the whole contract file repository in zips or the Sourcify database in Parquet format. Follow the guides in each page:
-- [Download RepositoryV2](/docs/repository/file-repositories/#download)
-- [Download SourcifyDatabase](/docs/repository/sourcify-database/#download)
+You can download the Sourcify database in Parquet format. Follow this guide: [Download the Dataset](/docs/repository/download-dataset/).
diff --git a/src/css/custom.css b/src/css/custom.css
index e194582..2524998 100644
--- a/src/css/custom.css
+++ b/src/css/custom.css
@@ -48,6 +48,10 @@ h4 {
font-family: "VT323";
}
+h4 {
+ font-size: 1.3rem;
+}
+
.navbar__logo > img {
border-radius: 9999px;
}