Puppeteer Scraping (k8s)

Because I’m sitting on a cluster and knowledge is heavy (i keep paying for it because so much time was invested learning). It’s never sunk. ☸️☸️☸️☸️

Crawlee

Amazing tool.

Had to use CheerioCrawler vs. PuppeteerCrawler sadly.

BLUF: Browsers in GKE pods is really hard. They don’t want to connect to the internet.

After exhaustivly testing puppeteer, only a vanilla Playwright works. Crawlee does not work. See the part on Playwright below.

Why? The chrome running inside to pod on my GKE cannot access the internet. Did I misconfigure the launcher args? Are one of these breaking the access? No clue. I’ve searched a whole day.

 [
  '--disable-background-timer-throttling',
  '--disable-extensions',
  '--disable-backgrounding-occluded-windows',
  '--disable-ipc-flooding-protection',
  '--disable-renderer-backgrounding',
  '--no-sandbox',
  '--disable-gpu',
  '--single-process',
  '--disable-setuid-sandbox',
  '--no-zygote',
  // This will write shared memory files into /tmp instead of /dev/shm,
  // because Docker’s default for /dev/shm is 64MB
  '--disable-dev-shm-usage',
]

Event after configuring the pods correctly. Puppeteer errored.

Inside GKE - the following is important in any case.

Pods by default cannot access the internet. They need an Egress rule applied.

Cloud NATs if very important. This allows pods to on a configured node-pool to connect to the internet.

You could create a whole new VPC and Subnet and Firewall Rules or you can just chnage the default one here The link needs the projectId fixes.

Adding an Egress rule on 80 and 443 will do the trick.

Create node-pool

This is a standard one that will have default VPC network and subnet applied. But --network and --subnet are also options.

gcloud container node-pools create headless --cluster rad1 --disk-size=30GB --num-nodes=1 --enable-autoscaling --max-nodes=1 --min-nodes=1 --enable-autorepair --max-surge-upgrade=1 --max-unavailable-upgrade=0 --machine-type=e2-small --node-taints crawler=true:NoSchedule --node-labels app=crawler --zone=us-central1-c

This node-pool has been tainted so only pods with affinity and tolerations will be allowed.

I only write Tanka

Here is a complete Cronjob

{
  _cronJob+:: {
    new(config): {
      local localConfig = std.mergePatch({
        timeZone: 'Etc/UTC',
        restartPolicy: 'Never',
        concurrencyPolicy: 'Replace',
      }, config),

      my_namespace: {
        apiVersion: 'batch/v1',
        kind: 'CronJob',
        metadata: {
          name: localConfig.name,
        },
        spec: {
          jobTemplate: {
            spec: {
              template: {
                spec: {
                  volumes: [
                    {
                      secret: {
                        secretName: localConfig.gcpSecret,
                      },
                      name: localConfig.gcpSecret,
                    },
                  ],
                  tolerations: if 'tolerations' in config then config.tolerations else [],
                  affinity: {
                    nodeAffinity: {
                      requiredDuringSchedulingIgnoredDuringExecution: {
                        nodeSelectorTerms: [
                          {
                            matchExpressions: if 'nodeAffinityMatchExpressions' in config then config.nodeAffinityMatchExpressions else [],
                          },
                        ],
                      },
                    },
                  },
                  containers: [
                    {
                      command: localConfig.command,
                      env: localConfig.env,
                      volumeMounts: [
                        {
                          mountPath: '/etc/gcp',
                          name: localConfig.gcpSecret,
                          readOnly: true,
                        },
                      ],
                      name: localConfig.name,
                      image: localConfig.image,
                      imagePullPolicy: 'Always',
                      resources: if 'resources' in config then config.resources else {
                        limits: {
                          cpu: '40m',
                          memory: '2Gi',
                        },
                        requests: {
                          cpu: '20m',
                          memory: '100Mi',
                        },
                      },
                    },
                  ],
                  restartPolicy: localConfig.restartPolicy,
                },
              },
              backoffLimit: localConfig.backoffLimit,
              activeDeadlineSeconds: localConfig.activeDeadlineSeconds,
            },
          },
          timeZone: localConfig.timeZone,
          schedule: localConfig.schedule,
          concurrencyPolicy: localConfig.concurrencyPolicy,
        },
      },
    },
  },
}

(import 'redis/index.libsonnet') +
(import 'cron-job/index.libsonnet') +
{
  _tickle_crawler+:: {
    new(): {

      local redisConfig = std.mergePatch({
        env: $._config.environment,
        resources: {
          limits: {
            cpu: '33m',
            memory: '200Mi',
          },
          requests: {
            cpu: '24m',
            memory: '50Mi',
          },
        },
       tolerations: [{ key: 'crawler', operator: 'Equal', value: 'true' }],
        nodeAffinityMatchExpressions: [{ key: 'app', operator: 'In', values: ['crawler'] }],
        size: '50Mi',
        args: [
          '/etc/redis/redis.conf',
          '--protected-mode',
          'no',
          '--save',
          '60 2',
        ],
        name: 'tkl-crwlr',
        storageClass: 'rds-tkle-crwlr',
      }, {},),

      'redis-db': $._redis.new(redisConfig),

      cron: $._cronJob.new({
        gcpSecret: '',
        backoffLimit: 0,
        activeDeadlineSeconds: 9000,
        restartPolicy: 'Never',
        env: [
        ],
        tolerations: [{ key: 'crawler', operator: 'Equal', value: 'true' }],
        nodeAffinityMatchExpressions: [{ key: 'app', operator: 'In', values: ['crawler'] }],
        command: [
          'yarn',
          'start',
          '--redisHost',
          'tkl-crwlr-svc-prod.tkle-crwlr.svc.cluster.local',
          '--redisPort',
          '6379',
        ],
        image: 'IMAGE',
        schedule: '*/50 * * * *',
        // schedule: '0 */12 * * *',
        name: 'tkle-crwlr',
        resources: {
          limits: {
            cpu: '100m',
            memory: '2Gi',
          },
          requests: {
            cpu: '50m',
            memory: '900Mi',
          },
        },
      }),


    },
  },
}

With this applied in the Tanka way, the crons are running on a node-pool just for themselves.

Playwright

Dockerfile

# syntax = docker/dockerfile:1

# Adjust NODE_VERSION as desired
ARG NODE_VERSION=18.19
ARG TEST_USER="root"

FROM mcr.microsoft.com/playwright:focal as base

# Install Node.js (pick the version you need)
# RUN apt-get update && apt-get install -y curl software-properties-common \
#     curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash - \
#     && apt-get install -y nodejs


# ENV EXECUTABLE_PATH=/usr/bin/google-chrome
# ENV EXECUTABLE_PATH=/usr/bin/chromium

# Create user
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 playwright


# Setup user
USER ${TEST_USER}

RUN echo "test user: $TEST_USER"

# Node.js app lives here
WORKDIR /app

# https://crawlee.dev/api/core/interface/ConfigurationOptions#availableMemoryRatio
# https://betterprogramming.pub/web-crawler-with-crawlee-and-aws-lambda-223582bdca3e
ENV CRAWLEE_AVAILABLE_MEMORY_RATIO 0.9
# Set production environment
ENV NODE_ENV=production
ENV PUPPETEER_CACHE_DIR=/app/.cache

# Throw-away build stage to reduce size of final image
FROM base as build

# Install packages needed to build node modules
RUN apt-get update -qq && \
    apt-get install -y python-is-python3 pkg-config build-essential

# Install node modules
RUN echo $(node -v)

RUN yarn set version berry && touch yarn.lock && touch .yarnrc.yml
COPY ./package.json ./package.json
COPY ./esbuild.ts ./esbuild.ts

RUN yarn config set nodeLinker "node-modules"

# Silent?  >/dev/null 2>&1
RUN yarn install

COPY ./entrypoint.sh ./entrypoint.sh
COPY ./tsconfig.json ./tsconfig.json
COPY ./tsconfig.build.json ./tsconfig.build.json
COPY ./configs ./configs

# Copy application code
COPY --link ./src ./src

# Build application
# RUN yarn compile-puppet
# RUN npx @puppeteer/browsers install chrome

# Final stage for app image
FROM base

# Copy built application
COPY --from=build /app /app

# Start the server by default, this can be overwritten at runtime
EXPOSE 3000 80 443

CMD ["yarn"]

Inside a node script, this works in GKE

    // Launch the browser
      const browser = await chromium.launch({
        headless: true,
        args: [...chromeArgs],
      })
      console.log('Launched!')
      // Create a new page
      const page = await browser.newPage()
      console.log('New pge')
      // Navigate to Google
      await page.goto('https://www.google.com')

      // Wait for the title to ensure the page is loaded
      console.log(await page.title()) // Should log "Google"
      const html = await page.content()
      console.log(html)
      // Close the browser
      await browser.close()

Gotchas

CRAWLEE_AVAILABLE_MEMORY_RATIO was important so the kube pod would have memory filled. This doesnt matter so much because Crawlee doesnt work in GKE in Puppeteer or Playwright.
It seems the instances of CheerioCrawler are not totally isolated, "maxRequestsPerCrawl" seems to apply to all instances in the node process that get spawned…