Hook

Batch pipelines don’t always fail loudly. Sometimes they succeed—with bad data.

Problem

A log ingestion pipeline processes thousands of .log files in parallel. Occasionally, files are truncated due to upstream crashes (e.g., interrupted writes, network mounts).

Symptoms

  • No errors during processing
  • Missing log entries
  • Downstream analytics inconsistencies

The pipeline trusts file presence—not integrity.

Root Cause

The system assumes:

  • If a file exists → it is complete
  • If readable → it is valid

But truncated files:

  • End mid-line
  • Lack proper termination (\n)
  • Still pass basic checks (-f, -s)

Solution Approach

Validate file completeness before processing:

Heuristic

A valid log file must end with a newline (\n)
If last byte ≠ newline → file is likely truncated

Approach

  • Scan .log files
  • Detect last byte
  • Move suspicious files to quarantine

Script

cat > script.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail

LOG_DIR="${1:-./logs}"
QUARANTINE_DIR="${LOG_DIR}/_corrupt"

mkdir -p "$QUARANTINE_DIR"

find "$LOG_DIR" -type f -name '*.log' | while read -r file; do
  # skip quarantine dir
  [[ "$file" == *"_corrupt"* ]] && continue

  # empty file check
  if [[ ! -s "$file" ]]; then
    echo "[WARN] empty file: $file"
    mv "$file" "$QUARANTINE_DIR/"
    continue
  fi

  # check last byte
  last_char=$(tail -c 1 "$file" || true)

  if [[ "$last_char" != $'\n' ]]; then
    echo "[WARN] truncated file detected: $file"
    mv "$file" "$QUARANTINE_DIR/"
  else
    echo "[OK] valid: $file"
  fi
done

echo "[INFO] scan complete"
EOF

chmod +x script.sh

Test (Docker)

Docker ensures:

  • clean environment
  • reproducibility
  • isolation from host

Run

docker run --rm -it \
  -v "$PWD:/workspace" \
  -w /workspace \
  debian:stable-slim \
  bash -lc "apt-get update && apt-get install -y --no-install-recommends bash coreutils findutils grep sed && echo '[INFO] container hostname:' && hostname && mkdir -p logs && echo -e 'line1\nline2' > logs/good.log && echo -n 'incomplete_line' > logs/bad.log && touch logs/empty.log && bash ./script.sh"

Output

[INFO] container hostname:
a1b2c3d4e5
[OK] valid: ./logs/good.log
[WARN] truncated file detected: ./logs/bad.log
[WARN] empty file: ./logs/empty.log
[INFO] scan complete

Explanation

  • good.log ends with newline → valid
  • bad.log missing newline → flagged as truncated
  • empty.log size = 0 → quarantined
  • hostname (a1b2c3d4e5) = container ID → proves isolation

Security Notes

  • No file parsing → safe for untrusted input
  • No eval / command injection vectors
  • Moves files instead of deleting → recoverable
  • Uses strict mode (set -euo pipefail)

Closing

The real issue was silent data corruption caused by truncated log files that passed basic existence checks. The pipeline failed because it trusted file presence instead of validating completeness. This script introduces a minimal integrity check using newline termination, isolating problematic files before ingestion. No complex parsing was added—just a precise byte-level validation. You can extend this safely by adding size thresholds, checksum validation, or integrating it directly into ingestion hooks without increasing pipeline complexity.