Hook
Batch pipelines don’t always fail loudly. Sometimes they succeed—with bad data.
Problem
A log ingestion pipeline processes thousands of .log files in parallel. Occasionally, files are truncated due to upstream crashes (e.g., interrupted writes, network mounts).
Symptoms
- No errors during processing
- Missing log entries
- Downstream analytics inconsistencies
The pipeline trusts file presence—not integrity.
Root Cause
The system assumes:
- If a file exists → it is complete
- If readable → it is valid
But truncated files:
- End mid-line
- Lack proper termination (
\n) - Still pass basic checks (
-f,-s)
Solution Approach
Validate file completeness before processing:
Heuristic
A valid log file must end with a newline (\n)
If last byte ≠ newline → file is likely truncated
Approach
- Scan
.logfiles - Detect last byte
- Move suspicious files to quarantine
Script
cat > script.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail
LOG_DIR="${1:-./logs}"
QUARANTINE_DIR="${LOG_DIR}/_corrupt"
mkdir -p "$QUARANTINE_DIR"
find "$LOG_DIR" -type f -name '*.log' | while read -r file; do
# skip quarantine dir
[[ "$file" == *"_corrupt"* ]] && continue
# empty file check
if [[ ! -s "$file" ]]; then
echo "[WARN] empty file: $file"
mv "$file" "$QUARANTINE_DIR/"
continue
fi
# check last byte
last_char=$(tail -c 1 "$file" || true)
if [[ "$last_char" != $'\n' ]]; then
echo "[WARN] truncated file detected: $file"
mv "$file" "$QUARANTINE_DIR/"
else
echo "[OK] valid: $file"
fi
done
echo "[INFO] scan complete"
EOF
chmod +x script.sh
Test (Docker)
Docker ensures:
- clean environment
- reproducibility
- isolation from host
Run
docker run --rm -it \
-v "$PWD:/workspace" \
-w /workspace \
debian:stable-slim \
bash -lc "apt-get update && apt-get install -y --no-install-recommends bash coreutils findutils grep sed && echo '[INFO] container hostname:' && hostname && mkdir -p logs && echo -e 'line1\nline2' > logs/good.log && echo -n 'incomplete_line' > logs/bad.log && touch logs/empty.log && bash ./script.sh"
Output
[INFO] container hostname:
a1b2c3d4e5
[OK] valid: ./logs/good.log
[WARN] truncated file detected: ./logs/bad.log
[WARN] empty file: ./logs/empty.log
[INFO] scan complete
Explanation
good.logends with newline → validbad.logmissing newline → flagged as truncatedempty.logsize = 0 → quarantinedhostname(a1b2c3d4e5) = container ID → proves isolation
Security Notes
- No file parsing → safe for untrusted input
- No
eval/ command injection vectors - Moves files instead of deleting → recoverable
- Uses strict mode (
set -euo pipefail)
Closing
The real issue was silent data corruption caused by truncated log files that passed basic existence checks. The pipeline failed because it trusted file presence instead of validating completeness. This script introduces a minimal integrity check using newline termination, isolating problematic files before ingestion. No complex parsing was added—just a precise byte-level validation. You can extend this safely by adding size thresholds, checksum validation, or integrating it directly into ingestion hooks without increasing pipeline complexity.