library(targeter)
#> Loading required package: data.table
library(data.table)Introduction
This vignette demonstrates two powerful features of the targeter package that enhance standardization and professional reporting:
Variable Naming Conventions Using standardized prefixes to indicate variable types
Metadata Integration - Using descriptive labels for variables in reports
These features are particularly valuable in enterprise environments where consistency and professional presentation are essential.
Variable Naming Conventions
###The Naming Convention System The targeter package supports a precise naming convention where the first two characters of variable names (a letter followed by an underscore) indicate the variable type:
| Prefix | Variable Type | Example |
| N_, M_, F_, R_, P_, Y_, J_ | Numeric variables | |
| D_, S_ | Date variables | D_BIRTHDATE |
| C_, T_, L_ | Categorical variables | C_EDUCATION |
| Z_ | Target variables | Z_ABOVE50K |
| O_ | Ordinal variables | O_RISK_LEVEL |
| I_ | ID variables | I_CUSTOMER_ID |
This convention provides several benefits:
- Automatic variable type detection
- Standardized documentation
- Clearer communication in team settings
- Optimized statistical processing
Exploring the Adult Dataset
Let’s examine the standard adult dataset included with the package:
# Load the adult dataset
data(adult)
# Look at structure
str(adult[, 1:6])
#> 'data.frame': 32561 obs. of 6 variables:
#> $ AGE : int 39 50 38 53 28 37 49 52 31 42 ...
#> $ WORKCLASS : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
#> $ FNLWGT : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
#> $ EDUCATION : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
#> $ EDUCATIONNUM : int 13 13 9 7 13 14 5 9 14 13 ...
#> $ MARITALSTATUS: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...Creating Metadata for the Adult Dataset
For effective reporting, we’ll create a metadata table containing:
- Original variable names
- Appropriate naming convention prefixes
- Human-readable labels for reporting
On concrete cases, user will have at disposal such metadata, stored in a database or a file, and will be able to load it into its R environment. It is also possible to programatically generate similar metadata when you create variables, for instance when performing feafure engineering in a modeling pipeline.
# Create metadata table
adult_metadata <- data.frame(
variable = names(adult),
prefix = c(
"N", # AGE - numeric
"C", # WORKCLASS - categorical
"N", # FNLWGT - numeric
"C", # EDUCATION - categorical
"N", # EDUCATIONNUM - numeric
"C", # MARITALSTATUS - categorical
"C", # OCCUPATION - categorical
"C", # RELATIONSHIP - categorical
"C", # FACE - categorical
"C", # SEX - categorical
"N", # CAPITALGAIN - numeric
"N", # CAPITALLOSS - numeric
"N", # HOURSPERWEEK - numeric
"C", # NATIVECOUNTRY - categorical
"Z" # ABOVE50K - binary target
),
label = c(
"Age (years)",
"Employment sector",
"Final sampling weight",
"Educational attainment",
"Years of education",
"Marital status",
"Professional occupation",
"Family relationship",
"Race/ethnicity",
"Gender",
"Capital gain in $",
"Capital loss in $",
"Hours worked per week",
"Country of origin",
"Income above $50K"
)
)
# Display metadata
knitr::kable(adult_metadata)| variable | prefix | label |
|---|---|---|
| AGE | N | Age (years) |
| WORKCLASS | C | Employment sector |
| FNLWGT | N | Final sampling weight |
| EDUCATION | C | Educational attainment |
| EDUCATIONNUM | N | Years of education |
| MARITALSTATUS | C | Marital status |
| OCCUPATION | C | Professional occupation |
| RELATIONSHIP | C | Family relationship |
| RACE | C | Race/ethnicity |
| SEX | C | Gender |
| CAPITALGAIN | N | Capital gain in $ |
| CAPITALLOSS | N | Capital loss in $ |
| HOURSPERWEEK | N | Hours worked per week |
| NATIVECOUNTRY | C | Country of origin |
| ABOVE50K | Z | Income above $50K |
Applying the Naming Convention
Now we’ll programmatically rename the variables according to our naming conventions:
# Rename variables
# Create a renamed copy of the adult dataset
adult_renamed <- as.data.table(adult)
# Function to create new variable names
create_conventional_name <- function(original_name, prefix) {
paste0(prefix, "_", original_name)
}
# Rename the variables based on metadata
for (i in 1:nrow(adult_metadata)) {
old_name <- adult_metadata$variable[i]
new_name <- create_conventional_name(old_name, adult_metadata$prefix[i])
setnames(adult_renamed, old_name, new_name)
}
# Check the first few renamed variables
head(names(adult_renamed), 6)
#> [1] "N_AGE" "C_WORKCLASS" "N_FNLWGT" "C_EDUCATION"
#> [5] "N_EDUCATIONNUM" "C_MARITALSTATUS"Validating Naming Conventions
The check_naming_conventions() function validates that variable prefixes match their actual data types:
# Check naming conventions
# Check if our naming conventions are correctly applied
naming_check <- check_naming_conventions(adult_renamed)
head(naming_check)
#> ErrorType ErrorTarget
#> N_AGE FALSE NA
#> C_WORKCLASS FALSE NA
#> N_FNLWGT FALSE NA
#> C_EDUCATION FALSE NA
#> N_EDUCATIONNUM FALSE NA
#> C_MARITALSTATUS FALSE NA
# Count any errors in naming
sum(naming_check$ErrorType, na.rm = TRUE)
#> [1] 0A value of FALSE for ErrorType means the naming convention is correctly followed for that variable.
Using Naming Conventions with targeter
Let’s run a target analysis using our conventionally named variables:
# Run targeter with naming conventions enabled
tar_with_naming <- targeter(
data = adult_renamed,
target = "Z_ABOVE50K",
naming_conventions = TRUE,
verbose = FALSE
)
#>
#> INFO:target Z_ABOVE50K detected as type: binary
#> INFO:binary target contains number, automatic chosen level: 1; override using `target_reference_level`
# Examine variable classifications
head(tar_with_naming$variables[, c("variable", "var_type", "respect_naming_convention")])
#> variable var_type respect_naming_convention
#> <char> <char> <lgcl>
#> 1: C_EDUCATION categorical TRUE
#> 2: C_MARITALSTATUS categorical TRUE
#> 3: C_NATIVECOUNTRY categorical TRUE
#> 4: C_OCCUPATION categorical TRUE
#> 5: C_RACE categorical TRUE
#> 6: C_RELATIONSHIP categorical TRUEBy setting naming_conventions = TRUE, the targeter() function:
- Automatically detects variable types based on prefixes
- Validates that prefixes match actual data types
- Optimizes processing for each variable type
- Provides a more consistent analysis output
Using Metadata in Reports
The metadata we created can be used to generate professional reports with descriptive labels.
This is as simple as passing as parameter the metadata we just created to the targeter() function.
# Generate a report with metadata
report(
tar_with_naming,
metadata = adult_metadata,
title = "Income Analysis using Naming Conventions",
author = "Data Science Team"
)