Diagnosing Java + Oracle Performance with Dumps and Logs

Performance issues in Java applications backed by Oracle DB often present as generic "slowness." In this post, we walk through a practical workflow: capturing thread dumps, analyzing GC logs, taking heap histograms/dumps, correlating with Oracle AWR/ASH, and mapping symptoms to root causes.

Java + Oracle Performance Diagnostics

0) Quick Triage Flow (what to grab first)

1. High CPU or “app is hung” right now? - Capture 3 × thread dumps 10s apart → analyze for deadlocks, hotspots, blocked JDBC. - Grab GC log segment covering the slowdown. - If memory pressure suspected, trigger a heap histogram (fast) then a heap dump (slower; do it off-peak if huge).

2. Spikes when calling Oracle? - Capture thread dumps (look for oracle.jdbc. frames waiting on I/O). - From DB side, pull AWR/ASH for the same time window.

3. Latency every few minutes? - Check GC log for STW pauses, mixed cycles, promotion failures, humongous alloc (G1).

4. Steady memory climb? - Heap dumps (before/after) → MAT dominator tree → GC roots.

---

1) Safe & repeatable data capture

1.1 Identify the JVM

bash jcmd # lists JVMs ps -ef | grep java # find the target PID export PID=<java-pid>

1.2 Thread dumps (low overhead)

bash jcmd $PID Thread.print > tdump_$(date +%F_%T).txt

or

jstack -l $PID > tdump_$(date +%F_%T).txt

Best practice: take 3 dumps 10 seconds apart to distinguish transient vs. persistent states.

1.3 Heap snapshot options

Fast first look (histogram):

bash jcmd $PID GC.class_histogram > histo_$(date +%F_%T).txt

or

jmap -histo $PID > histo_$(date +%F_%T).txt

Full heap dump (can pause briefly, large files):

bash jcmd $PID GC.heap_dump filename=/tmp/heap_$(date +%F_%T).hprof

or (Java 8):

jmap -dump:format=b,file=/tmp/heap_$(date +%F_%T).hprof $PID

1.4 GC logs

Java 9+:

bash -Xlog:gc,gc+age=trace:file=/var/log/app/gc.log:time,uptime,level,tags

Java 8:

bash -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/app/gc.log -XX:+PrintTenuringDistribution

1.5 JFR (Java Flight Recorder)

bash jcmd $PID JFR.start name=diag settings=profile filename=/tmp/jfr_$(date +%F_%T).jfr duration=120s

1.6 App & OS event logs

App logs (ERROR/WARN around incident). Container/host logs: dmesg, journalctl, sar, vmstat, iostat, top. Connection pool debug (timeouts, wait times).

1.7 Oracle DB

AWR report for the incident window. ASH report (fine-grained timeline). Top SQL by elapsed time / buffer gets / executions. Wait event breakdown. Optionally SQL Trace (10046) + tkprof.

---

2) Analyzing thread dumps

What to look for

States: RUNNABLE, BLOCKED, WAITING. Hot methods at the top of many stacks. Locks/monitors:

One monitor owner, many blocked threads → contention. Deadlock section (jstack prints automatically). JDBC: oracle.jdbc.driver. or socket reads = DB latency.

Many threads stuck in getConnection = DB pool starvation.

Rapid checklist

GC threads dominating? (GC Thread, G1Conc) Many threads in synchronized blocks? Workers stuck on CompletableFuture.join() / CountDownLatch.await()? N+1 query patterns?

---

3) Heap histogram & dump

Histogram first

Check retained size: byte[], char[], HashMap, ArrayList, caches. Classloader leaks: many Class/ClassLoader. Lots of duplicate char[] (enable string deduplication).

Full dump (MAT/VisualVM)

1. Open .hprof in Eclipse MAT → Leak Suspects. 2. Dominator Tree: sort by retained heap. 3. Leak patterns:

Unbounded caches. Static maps holding sessions. ThreadLocals not cleared. Large arrays never freed. Redeploy classloader leaks.

Memory pressure patterns

High allocation churn → optimize hot paths. G1 humongous objects → tune -XX:G1HeapRegionSize.

---

4) GC log analysis

Tools

GCViewer, GCeasy.

Key signals

STW duration & frequency. Allocation rate. Promotion failure. CMS concurrent mode failure / G1 mixed GC storms. Humongous allocations.

G1 quick mapping

Rising old gen during Mixed GCs → leak/retained set. Long Remark/Cleanup → huge live set. Frequent Full GCs → leak or undersized heap.

---

5) Application stack traces

Timeout storms (JDBC/socket). OOM or StackOverflowError. Retry loops, circuit breaker opens. Profile hot methods with JFR.

---

6) Oracle DB angle

DB-side

AWR Top Events: db file scattered read, log file sync, row lock contention, cursor: mutex S. Top SQL: elapsed time, gets. ASH timeline: correlate with app pauses. Temp/sort usage: spills to disk.

JVM-side

Threads waiting in JDBC + high DB waits = DB bottleneck. Pool metrics: wait time, active vs max. Fix:

Add indexes, avoid full scans. Batch ops, fewer commits. Resize pool only if DB has headroom. Backpressure/timeouts. Use SQL tracing + EXPLAIN PLAN.

---

7) Symptom → Cause map

| Symptom | Likely Cause | Confirm | Fix | | ---------------- | ---------------------------- | ------------------ | -------------------------------------- | | Long GC pauses | Heap undersized / high alloc | GC log | Right-size, tune G1 | | RSS growth → OOM | Leak | MAT dominator tree | Fix leak, evict | | CPU pegged | Hot loop | JFR hotspot | Optimize/caching | | BLOCKED threads | Lock contention | Thread dumps | Refactor locking | | JDBC waits | DB bottleneck | Thread dumps + AWR | Index/tune, pool | | Frequent Full GC | Old gen pressure | GC logs | Increase heap, reduce long-lived alloc | | Metaspace OOM | CL leak | Heap dump | Fix redeploy pattern |

---

8) Minimal capture script

bash #!/usr/bin/env bash set -euo pipefail PID="${1:-$(jcmd | awk '/java/{print $1; exit}')}"

STAMP=$(date +%F_%H%M%S) OUT="diag_$STAMP"; mkdir -p "$OUT"

echo "[] Thread dumps..." for i in 1 2 3; do jcmd $PID Thread.print > "$OUT/tdump_${i}.txt" sleep 10 done

echo "[] Heap histogram..." jcmd $PID GC.class_histogram > "$OUT/histo.txt"

echo "[] 120s JFR..." jcmd $PID JFR.start name=diag settings=profile filename="$OUT/prof.jfr" duration=120s

echo "[] JVM & OS stats..." jcmd $PID VM.flags > "$OUT/vm_flags.txt" jcmd $PID VM.system_properties > "$OUT/sysprops.txt" (top -b -n 1; free -m; vmstat 1 5; iostat -xz 1 5) > "$OUT/os_snapshot.txt"

echo "Copy GC + app logs into $OUT/" echo "Done: $OUT"

---

9) Quick tuning

JVM (G1 starter flags)

bash -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=30 -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication

Connection pools (HikariCP)

ini maximumPoolSize = match DB cores connectionTimeout = 3s–10s idleTimeout = 60s–300s maxLifetime = 30m

Resilience

Per-call timeouts. Bulkheads & circuit breakers. Backpressure.

---

10) Example: 2–4s latency spikes

1. GC log: Mixed GC every ~2m, 1.8s STW. 2. Histogram: many byte[] from JSON; alloc ~1–2 GB/s. 3. JFR hotspot: Jackson serialization. 4. Fix: cache/precompute JSON → GCs <200ms, p99 normalized.

---

11) Example: Throughput tanks at peak

1. Thread dumps: many threads in oracle.jdbc. reads; others waiting in pool. 2. AWR: db file sequential read, bad SQL_ID. 3. Plan: full scan due to missing index. 4. Fix: composite index + gather stats → pool wait gone, CPU normal.

---

12) Tooling shortlist

JDK: jcmd, jstack, jmap, jstat, JFR/JMC Heap: Eclipse MAT, VisualVM GC: GCViewer, GCeasy Profiling: JFR, async-profiler * DB: AWR, ASH, ADDM, tkprof, SQL Monitor, EXPLAIN PLAN

---

Wrap-up

Follow the flow: thread dumps → GC logs → heap histograms/dumps → JFR → AWR/ASH. Correlate timestamps ruthlessly. Section 7’s table maps symptoms to root causes and fixes.

Would you like me to also prep this as a ready-to-drop Markdown blog post (with proper front matter YAML so your blog system will render it)? `