Tabular Data Explorer

NHANES Mortality Covariate Exploration (All Features)

Compact HTML exploration of nhanes_mortality_covariates_all_features.csv with embedded plots, stable layout, and grounded narrative.

Source

nhanes_mortality_covariates_all_features.csv

Format

csv

Auto-excluded

None

Generated

2026-03-29T23:58:33-07:00

Rows

11,171

Total observations loaded.

Analyzed Columns

After user ignores and auto-excluded index fields.

Missing Rate

9.2%

Across analyzed cells.

Targets

mortstat

Columns treated as outcomes.

Narrative

Grounded interpretation

NHANES Mortality Covariate Analysis

Cohort framing

This run analyzes 11,171 rows and 80 covariates after excluding SEQN. The target mortstat is imbalanced but not extreme: 8,724 rows (78.1%) are 0 and 2,447 rows (21.9%) are 1. Overall data quality is reasonable at the table level, with 26 duplicate rows (0.23%) and an overall missing-cell rate of 9.2%, but that average hides a sharp difference between the lower-level and higher-level covariate bundles.

This file should also be read as a labeled subset rather than the full covariate source table. The export metadata for this dataset indicates that the wider feature source had 20,470 rows, while only 11,171 rows had valid mortality labels. That matters because the associations below describe the mortality-observed cohort, not everyone in the upstream NHANES feature export.

RIDAGEYR is the strongest single mortality signal in the run with eta-squared 0.381. It is not part of the l1 / l2 / l3 workbook bundles; it is the separately added age feature for the mortality-ready export. In practice it behaves like the anchor covariate against which the staged bundles should be judged.

Level hierarchy

`l1`: best baseline bundle

l1 is the cleanest starting point for staged modeling. Its six features have lower average missingness than l3 (7.0% vs 10.4%), and the strongest l1 mortality signal is BPXSY at eta-squared 0.124. That is materially stronger than the other core anthropometric features in this run and makes systolic blood pressure the clearest l1 candidate to carry incremental mortality information beyond age.

The main l1 structural issue is redundancy, not sparsity. BMXBMI and BMXWAIST are strongly correlated at 0.885, so they are likely to compete for similar body-composition signal in an interpretable model. Their distributions also show right tails (BMXBMI skew 1.50, max 130.21; BMXWAIST median 96.0 cm, max 175.0 cm), so robust scaling or outlier checks would matter even in the baseline bundle. BPXSY itself is right-skewed as well (skew 1.26, max 270), which is another reason to inspect its extreme tail before using coefficient-based models.

If the goal is a first-pass covariate model, l1 looks like the right place to start because it is small, comparatively clean, and clinically interpretable. The main caution is to avoid overcounting body-size information by treating BMXBMI and BMXWAIST as independent wins.

`l2`: strongest second-stage expansion

l2 is the best next layer after l1 because it is complete in this table: all five questionnaire / disease-history variables have 0% missingness. The activity-history block is the clearest l2 contributor to mortality separation: PAD200 scores 0.101, PAD320 0.0415, and PAD020 0.0328. DIQ010 also clears the report's weak-signal floor at 0.0217. At the threshold of eta-squared >= 0.02, l2 contributes four signals, which is much broader coverage than l1.

The caveat is that the workflow inferred these questionnaire variables as numeric because they are stored as integer-coded values in the CSV. That is visible in the profile overview, where 79 of 80 analyzed variables were treated as numeric. The effect sizes are still useful as a screening signal, but DIQ010, PAD020, PAD200, and PAD320 should be recoded as categorical before any final model comparison or coefficient interpretation. In other words, l2 looks genuinely useful, but the current ranking likely understates its category structure.

Given your priority order, l2 is the most defensible second-stage addition: it adds multiple mortality-relevant signals without adding the missingness and multicollinearity burden that comes with l3.

`l3`: highest upside, highest risk

l3 dominates the table numerically and statistically, but it is the hardest bundle to trust naively. It contains 67 laboratory, CBC, and dietary features, contributes 14 of the features with eta-squared >= 0.02, and includes the strongest non-age signal in the run: LBDSBUSI at 0.134. Other notable l3 mortality signals are LBDSCRSI (0.0566), LBXGH (0.0485), LBXRDW (0.0470), LBXSKSI (0.0452), and LBDSGLSI / DRXTKCAL (both 0.0375).

The problem is that l3 also carries almost all of the data-quality and redundancy burden. Its average missing rate is 10.4%, compared with 7.0% for l1 and 0% for l2. The worst missing columns are all l3: LBXSASSI (11.8%), LBXSATSI (11.8%), LBXSC3SI (11.8%), LBDSTPSI (11.2%), and LBDSGBSI (11.2%). The co-missing structure is nearly deterministic for several serum chemistry pairs, including LBXSASSI with LBXSATSI (joint missing rate 11.77%, Jaccard 0.999), LBXSATSI with LBXSC3SI (11.74%, 0.997), and LBDSTPSI with LBDSGBSI (11.19%, 1.000). That pattern looks much more like module-level collection availability than independent feature dropout.

l3 is also where almost all of the strongest linear correlations live. Among pairs with absolute correlation >= 0.75, 14 of 15 are l3-to-l3; the only non-l3 pair is BMXBMI vs BMXWAIST. The densest examples are DRXTMFAT vs DRXTTFAT (+0.978), LBXHCT vs LBXHGB (+0.974), LBXLYPCT vs LBXNEPCT (-0.938), DRXTSFAT vs DRXTTFAT (+0.937), and DRXTCARB vs DRXTKCAL (+0.890). This is exactly the kind of multicollinearity that can make a lab-heavy model look unstable unless the bundle is regularized or grouped before interpretation.

Some l3 distributions are also extremely heavy-tailed. LBDSCRSI has median 79.56, max 1573.52, and skew 16.0; LBDSBUSI has median 4.28, max 34.99, and skew 2.76. So even before feature selection, l3 is signaling a need for robust transforms, careful winsorization rules, or at least sensitivity checks on outlier handling.

For your stated preference order, the right interpretation is not that l3 is weak. It is that l3 is powerful but expensive: it offers the most incremental signal candidates, but it should be treated as a third-stage expansion only after the cleaner l1 and l2 bundles are established.

Recommended reading of the staged bundles

The evidence in this run supports a staged strategy:

Start with age + l1 as the clean baseline, with special attention to BPXSY and to BMXBMI / BMXWAIST redundancy.
Add l2 next, especially the physical-activity variables, because they are complete in this table and show consistent target separation.
Add l3 last, and treat it as a structured module rather than a flat list of independent features because its missingness and correlation patterns are bundle-level, not column-level accidents.

If you want the next pass to stay explainable, the main pre-modeling checks should be:

recode questionnaire fields such as DIQ010, PAD020, PAD200, and PAD320 as categorical rather than leaving them as numeric codes;
compare l1 models that include either BMXBMI or BMXWAIST before using both together;
run l3 sensitivity analyses with and without the heaviest-missing serum chemistry block, since several of those columns disappear together;
prefer grouped regularization or feature-cluster selection inside l3 instead of interpreting raw coefficients from a fully expanded lab-and-diet model.

Overview

Start with the structural shape of the dataset before diving into quality or relationships.

Variable role balance

How the dataset breaks down across numeric, categorical, boolean, datetime, text, and ID-like columns.

Why: This gives fast context on what kinds of analysis are appropriate for the rest of the report. Read: Large numeric or categorical shares suggest what analyses will dominate later sections. Caution: Role inference is heuristic, especially for messy object columns.

Data Quality

Focus on sparsity and structural quality risks before trusting downstream relationships.

Columns with the most missingness

Sorted view of the worst missing-rate columns.

Why: Missingness is often the fastest way to spot data collection gaps or unreliable features. Read: Compare both the rate and the ordering; a few bad columns matter differently from broad low-grade sparsity. Caution: Low missingness can still be harmful if it is concentrated in target-critical rows.

Co-missing column pairs

Pairs of columns that tend to go missing together.

Why: Co-missing patterns often reveal shared upstream failures or optional data capture paths. Read: Higher joint missingness and higher Jaccard values indicate tighter alignment in absence patterns. Caution: A pair can look strong simply because both columns are rare overall.

Term note: Jaccard overlap means the share of missing rows the two columns have in common among rows where either one is missing.

Relationships

Surface the strongest variable-to-variable structure before narrowing to one target.

Strongest numeric correlations

Top Pearson correlation pairs among numeric columns after auto-excluding index-like fields.

Why: This surfaces linear relationships, redundancy, and potential leakage candidates. Read: Bars near 1 or -1 indicate stronger positive or negative linear co-movement. Caution: Correlation is linear and pairwise; it misses nonlinear effects and can be distorted by outliers.

Term note: Pearson correlation ranges from -1 to 1 and summarizes straight-line association between two numeric variables.

High-correlation feature map

2D correlation view for the numeric features that repeatedly appear in the strongest linear pairs.

Why: When strong linear relationships are dense rather than isolated, a 2D map makes multicollinearity structure easier to inspect. Read: Look for dark blocks or clusters: they indicate groups of features that move together and may be partially redundant. Caution: This is still pairwise linear correlation, so it does not prove causality or replace domain review.

Term note: This heatmap is intended for multicollinearity screening, which is especially useful before explainable regression or survival modeling.

Target Exploration

Highlight how the declared targets relate to the rest of the dataset.

Top associations for target 'mortstat'

A ranked view of the strongest target-feature relationships in this run.

Why: This makes target-aware exploration concrete instead of leaving it as generic EDA. Read: Higher scores mean a stronger association under the method shown for each feature. Caution: Scores are descriptive, not causal, and methods differ between numeric and categorical features.

Term note: Eta-squared is a 0 to 1 effect-size score showing how much target separation is explained by a feature.

Priority Distributions

Zoom in on the columns most likely to drive interpretation in this dataset.

Distribution of 'INDHHINC'

Detailed view for the numeric column 'INDHHINC'.

Why: Inspect the shape, spread, and outliers in 'INDHHINC'. Read: Look for skew, multi-modality, and whether the box plot shows extreme values relative to the central mass. Caution: Heavy tails or clipping can make averages misleading; use this with quantiles, not in isolation.

Distribution of 'RIAGENDR'

Detailed view for the numeric column 'RIAGENDR'.

Why: Inspect the shape, spread, and outliers in 'RIAGENDR'. Read: Look for skew, multi-modality, and whether the box plot shows extreme values relative to the central mass. Caution: Heavy tails or clipping can make averages misleading; use this with quantiles, not in isolation.

Distribution of 'BMXBMI'

Detailed view for the numeric column 'BMXBMI'.

Why: Inspect the shape, spread, and outliers in 'BMXBMI'. Read: Look for skew, multi-modality, and whether the box plot shows extreme values relative to the central mass. Caution: Heavy tails or clipping can make averages misleading; use this with quantiles, not in isolation.

Distribution of 'BMXWAIST'

Detailed view for the numeric column 'BMXWAIST'.

Why: Inspect the shape, spread, and outliers in 'BMXWAIST'. Read: Look for skew, multi-modality, and whether the box plot shows extreme values relative to the central mass. Caution: Heavy tails or clipping can make averages misleading; use this with quantiles, not in isolation.

Distribution of 'BPXDI'

Detailed view for the numeric column 'BPXDI'.

Why: Inspect the shape, spread, and outliers in 'BPXDI'. Read: Look for skew, multi-modality, and whether the box plot shows extreme values relative to the central mass. Caution: Heavy tails or clipping can make averages misleading; use this with quantiles, not in isolation.

Distribution of 'BPXSY'

Detailed view for the numeric column 'BPXSY'.

Why: Inspect the shape, spread, and outliers in 'BPXSY'. Read: Look for skew, multi-modality, and whether the box plot shows extreme values relative to the central mass. Caution: Heavy tails or clipping can make averages misleading; use this with quantiles, not in isolation.