Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .env.template
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,13 @@ USER_AUTH_TOKEN_SECRET=user-auth-token-secret
USER_2FA_SECRET=user-2fa-secret
# Experimental features
EXPERIMENTAL_FEATURES=false

# DOCX/PDF conversion (optional)
# Max number of parallel DOCX->PDF conversions in process
DOCX_PDF_MAX_CONCURRENCY=2
# Set to true only if sandbox cannot be enabled in your runtime
PUPPETEER_NO_SANDBOX=false
# Optional Puppeteer settings for controlled environments
# PUPPETEER_CACHE_DIR=/var/cache/puppeteer
# PUPPETEER_SKIP_DOWNLOAD=true
# PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser
Binary file modified .yarn/install-state.gz
Binary file not shown.
54 changes: 54 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,3 +104,57 @@ By default, migrations are applied to the `public` schema; if you need to update
```shell
yarn dbmigrate:create --name=add-table-to-survey-schema-db-table --schema=survey
```

### DOCX to PDF conversion runtime notes

The DOCX to PDF helper uses `mammoth` + `puppeteer` at runtime. This has deployment implications:

- Installing `puppeteer` runs a postinstall step that downloads a Chromium build.
- Install time and artifact size increase significantly compared with typical Node.js dependencies.
- In restricted CI/CD or production networks, browser download can fail unless proxy/mirror settings are configured.

#### Linux runtime requirements

When using bundled Chromium, make sure your runtime image/host includes common Chromium libraries.
Typical Debian/Ubuntu packages include:

```shell
sudo apt-get update && sudo apt-get install -y \
ca-certificates fonts-liberation libasound2t64 libatk-bridge2.0-0 libatk1.0-0 \
libc6 libcairo2 libcups2t64 libdbus-1-3 libexpat1 libfontconfig1 libgbm1 \
libglib2.0-0 libgtk-3-0t64 libnspr4 libnss3 libpango-1.0-0 libx11-6 \
libx11-xcb1 libxcb1 libxcomposite1 libxdamage1 libxext6 libxfixes3 libxrandr2 \
xdg-utils
```

If your distro enforces sandbox restrictions (for example Ubuntu with AppArmor userns restrictions), you may need to configure the host sandbox appropriately. As a last resort, set:

```shell
PUPPETEER_NO_SANDBOX=true
```

This is less secure and should be used only in trusted environments.

#### Cache, proxies, and download controls

Useful environment variables:

- `PUPPETEER_CACHE_DIR`: where Chromium binaries are cached.
- `HTTP_PROXY` / `HTTPS_PROXY` / `NO_PROXY`: proxy configuration for download/runtime network.
- `PUPPETEER_SKIP_DOWNLOAD=true`: skip bundled Chromium download (requires a system Chromium + executable path).
- `PUPPETEER_EXECUTABLE_PATH`: path to a managed system Chromium/Chrome executable.

#### Concurrency control

DOCX to PDF conversion is CPU/memory intensive. The converter uses an internal queue with bounded concurrency.
Tune it with:

```shell
DOCX_PDF_MAX_CONCURRENCY=2
```

#### Should you use puppeteer-core instead?

- Use `puppeteer` when you want dependency-managed Chromium and simpler setup.
- Use `puppeteer-core` when production images already provide Chromium/Chrome and you want smaller/faster installs.
- If you switch to `puppeteer-core`, make `PUPPETEER_EXECUTABLE_PATH` (or an equivalent app-specific env var) mandatory in deployment.
2 changes: 2 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,13 @@
"jsonwebtoken": "^9.0.3",
"lodash.throttle": "^4.1.1",
"log4js": "^6.9.1",
"mammoth": "^1.12.0",
"otplib": "^13.4.0",
"passport": "^0.7.0",
"passport-jwt": "^4.0.1",
"passport-local": "^1.0.0",
"pg-promise": "^12.6.2",
"puppeteer": "^24.43.1",
Comment thread
SteRiccio marked this conversation as resolved.
"socket.io": "^4.8.3"
},
"scripts": {
Expand Down
2 changes: 1 addition & 1 deletion src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -43,4 +43,4 @@ export type { WorkerErrorMessage, WorkerMessage } from './thread'

export { WebSocketEvent, WebSocketServer } from './webSocket'

export { Requests, Responses } from './utils'
export { DocxConverter, Requests, Responses } from './utils'
184 changes: 184 additions & 0 deletions src/utils/docxConverter.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
import path from 'node:path'
import fs from 'node:fs'
import { randomUUID } from 'node:crypto'
import mammoth from 'mammoth'
import puppeteer from 'puppeteer'

import { ProcessEnv } from '../processEnv'

const pdfPageFormat = 'A4'
const pdfPageMargin = { top: '20mm', right: '20mm', bottom: '20mm', left: '20mm' }
const pageLoadTimeoutMs = 15000
const parsedMaxConcurrency = Number(process.env.DOCX_PDF_MAX_CONCURRENCY ?? 2)
const maxPdfConversionsInParallel =
Number.isFinite(parsedMaxConcurrency) && parsedMaxConcurrency > 0 ? Math.floor(parsedMaxConcurrency) : 2

Comment on lines +11 to +15
let runningPdfConversions = 0
const pendingPdfConversions: Array<() => void> = []
let sharedBrowser: Awaited<ReturnType<typeof puppeteer.launch>> | null = null
let sharedBrowserPromise: Promise<Awaited<ReturnType<typeof puppeteer.launch>>> | null = null

const acquireConversionSlot = async (): Promise<() => void> =>
new Promise((resolve) => {
const grantSlot = () => {
runningPdfConversions += 1
resolve(() => {
runningPdfConversions -= 1
const next = pendingPdfConversions.shift()
if (next) {
next()
}
})
}

if (runningPdfConversions < maxPdfConversionsInParallel) {
grantSlot()
} else {
pendingPdfConversions.push(grantSlot)
}
})

const isAllowedRequestUrl = (url: string): boolean =>
url.startsWith('data:') || url === 'about:blank' || url.startsWith('blob:')

const isSandboxLaunchError = (error: unknown): boolean => {
const message = error instanceof Error ? error.message : String(error)
return message.includes('No usable sandbox') || message.includes('zygote_host_impl_linux')
}

const launchBrowser = async (): Promise<Awaited<ReturnType<typeof puppeteer.launch>>> => {
const forceNoSandbox = process.env.PUPPETEER_NO_SANDBOX === 'true'

if (forceNoSandbox) {
return puppeteer.launch({
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
})
Comment on lines +49 to +56
}

try {
return await puppeteer.launch({ headless: true })
} catch (error: unknown) {
if (!isSandboxLaunchError(error)) {
throw error
}

return puppeteer.launch({
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
})
}
}

const getSharedBrowser = async (): Promise<Awaited<ReturnType<typeof puppeteer.launch>>> => {
if (sharedBrowser) {
return sharedBrowser
}

if (sharedBrowserPromise) {
return sharedBrowserPromise
}

sharedBrowserPromise = launchBrowser()
.then((browser) => {
sharedBrowser = browser
sharedBrowserPromise = null
browser.on('disconnected', () => {
sharedBrowser = null
})
return browser
})
.catch((error) => {
sharedBrowserPromise = null
throw error
})

return sharedBrowserPromise
}

const toPrintableHtml = (bodyHtml: string): string => `
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style>
@page { size: A4; margin: 20mm; }
html, body { font-family: Calibri, Arial, sans-serif; font-size: 12pt; line-height: 1.4; }
img { max-width: 100%; height: auto; }
table { border-collapse: collapse; width: 100%; }
th, td { vertical-align: top; }
p { margin: 0 0 8pt 0; }
</style>
</head>
<body>${bodyHtml}</body>
</html>`

/**
* Converts a DOCX file (provided as a Buffer) to PDF.
* @param inputBuffer - The DOCX file as a Buffer.
* @param outputPath - The path where the PDF file will be saved. If not provided, it will use a temporary directory.
* @returns The path to the generated PDF file.
*/
const convertDocxToPdf = async (inputBuffer: Buffer, outputPath?: string): Promise<string> => {
const tempDir = ProcessEnv.tempFolder
const id = randomUUID()
fs.mkdirSync(tempDir, { recursive: true })
const resolvedOutputPath = outputPath || path.join(tempDir, `temp-${id}.pdf`)
const outputDir = path.dirname(resolvedOutputPath)
fs.mkdirSync(outputDir, { recursive: true })
Comment thread
SteRiccio marked this conversation as resolved.

const releaseConversionSlot = await acquireConversionSlot()
let page: Awaited<ReturnType<Awaited<ReturnType<typeof puppeteer.launch>>['newPage']>> | null = null

try {
const conversion = await mammoth.convertToHtml({ buffer: inputBuffer })
const printableHtml = toPrintableHtml(conversion.value)

const browser = await getSharedBrowser()
page = await browser.newPage()
// Hardening: abort non-local requests to avoid SSRF/network egress and flaky remote fetches.
await page.setRequestInterception(true)
page.on('request', (request) => {
if (isAllowedRequestUrl(request.url())) {
request.continue().catch(() => undefined)
} else {
request.abort().catch(() => undefined)
}
})

page.setDefaultNavigationTimeout(pageLoadTimeoutMs)
page.setDefaultTimeout(pageLoadTimeoutMs)
await page.setContent(printableHtml, { waitUntil: 'domcontentloaded', timeout: pageLoadTimeoutMs })
await page.pdf({
path: resolvedOutputPath,
format: pdfPageFormat,
printBackground: true,
margin: pdfPageMargin,
timeout: pageLoadTimeoutMs,
})

return resolvedOutputPath
} catch (error: unknown) {
const fallbackHint =
isSandboxLaunchError(error) || process.env.PUPPETEER_NO_SANDBOX === 'true'
? ' Chromium sandbox issue detected. You can set PUPPETEER_NO_SANDBOX=true (less secure) or configure a usable sandbox in the host OS.'
: ''

throw new Error(
`Failed to convert DOCX to PDF: ${error instanceof Error ? error.message : String(error)}${fallbackHint}`,
{
cause: error,
}
)
} finally {
if (page) {
await page.close().catch(() => undefined)
}
releaseConversionSlot()
}
}

export const DocxConverter = {
convertDocxToPdf,
}
2 changes: 1 addition & 1 deletion src/utils/index.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
export { DocxConverter } from './docxConverter'
export { Requests } from './requests'

export { Responses } from './responses'
Loading
Loading