Naming Conventions and Metadata • targeter

Introduction

This vignette demonstrates two powerful features of the targeter package that enhance standardization and professional reporting:

Variable Naming Conventions Using standardized prefixes to indicate variable types
Metadata Integration - Using descriptive labels for variables in reports

These features are particularly valuable in enterprise environments where consistency and professional presentation are essential.

Variable Naming Conventions

###The Naming Convention System The targeter package supports a precise naming convention where the first two characters of variable names (a letter followed by an underscore) indicate the variable type:

Prefix	Variable Type	Example
N_, M_, F_, R_, P_, Y_, J_	Numeric variables	`N_AGE`
D_, S_	Date variables	D_BIRTHDATE
C_, T_, L_	Categorical variables	C_EDUCATION
Z_	Target variables	Z_ABOVE50K
O_	Ordinal variables	O_RISK_LEVEL
I_	ID variables	I_CUSTOMER_ID

This convention provides several benefits:

Automatic variable type detection
Standardized documentation
Clearer communication in team settings
Optimized statistical processing

library(targeter)
#> Loading required package: data.table
library(data.table)

Exploring the Adult Dataset

Let’s examine the standard adult dataset included with the package:

# Load the adult dataset
data(adult)

# Look at structure
str(adult[, 1:6])
#> 'data.frame':    32561 obs. of  6 variables:
#>  $ AGE          : int  39 50 38 53 28 37 49 52 31 42 ...
#>  $ WORKCLASS    : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
#>  $ FNLWGT       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
#>  $ EDUCATION    : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
#>  $ EDUCATIONNUM : int  13 13 9 7 13 14 5 9 14 13 ...
#>  $ MARITALSTATUS: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...

Creating Metadata for the Adult Dataset

For effective reporting, we’ll create a metadata table containing:

Original variable names
Appropriate naming convention prefixes
Human-readable labels for reporting

On concrete cases, user will have at disposal such metadata, stored in a database or a file, and will be able to load it into its R environment. It is also possible to programatically generate similar metadata when you create variables, for instance when performing feafure engineering in a modeling pipeline.

# Create metadata table
adult_metadata <- data.frame(
  variable = names(adult),
  prefix = c(
    "N", # AGE - numeric
    "C", # WORKCLASS - categorical
    "N", # FNLWGT - numeric
    "C", # EDUCATION - categorical
    "N", # EDUCATIONNUM - numeric
    "C", # MARITALSTATUS - categorical
    "C", # OCCUPATION - categorical
    "C", # RELATIONSHIP - categorical
    "C", # FACE - categorical
    "C", # SEX - categorical
    "N", # CAPITALGAIN - numeric
    "N", # CAPITALLOSS - numeric
    "N", # HOURSPERWEEK - numeric
    "C", # NATIVECOUNTRY - categorical
    "Z"  # ABOVE50K - binary target
  ),
  label = c(
    "Age (years)",
    "Employment sector",
    "Final sampling weight",
    "Educational attainment",
    "Years of education",
    "Marital status",
    "Professional occupation",
    "Family relationship",
    "Race/ethnicity",
    "Gender",
    "Capital gain  in $",
    "Capital loss in $",
    "Hours worked per week",
    "Country of origin",
    "Income above $50K"
  )
)

# Display metadata
knitr::kable(adult_metadata)

variable	prefix	label
AGE	N	Age (years)
WORKCLASS	C	Employment sector
FNLWGT	N	Final sampling weight
EDUCATION	C	Educational attainment
EDUCATIONNUM	N	Years of education
MARITALSTATUS	C	Marital status
OCCUPATION	C	Professional occupation
RELATIONSHIP	C	Family relationship
RACE	C	Race/ethnicity
SEX	C	Gender
CAPITALGAIN	N	Capital gain in $
CAPITALLOSS	N	Capital loss in $
HOURSPERWEEK	N	Hours worked per week
NATIVECOUNTRY	C	Country of origin
ABOVE50K	Z	Income above $50K

Applying the Naming Convention

Now we’ll programmatically rename the variables according to our naming conventions:

# Rename variables
# Create a renamed copy of the adult dataset
adult_renamed <- as.data.table(adult)

# Function to create new variable names
create_conventional_name <- function(original_name, prefix) {
  paste0(prefix, "_", original_name)
}

# Rename the variables based on metadata
for (i in 1:nrow(adult_metadata)) {
  old_name <- adult_metadata$variable[i]
  new_name <- create_conventional_name(old_name, adult_metadata$prefix[i])
  setnames(adult_renamed, old_name, new_name)
}

# Check the first few renamed variables
head(names(adult_renamed), 6)
#> [1] "N_AGE"           "C_WORKCLASS"     "N_FNLWGT"        "C_EDUCATION"    
#> [5] "N_EDUCATIONNUM"  "C_MARITALSTATUS"

Validating Naming Conventions

The check_naming_conventions() function validates that variable prefixes match their actual data types:

# Check naming conventions
# Check if our naming conventions are correctly applied
naming_check <- check_naming_conventions(adult_renamed)
head(naming_check)
#>                 ErrorType ErrorTarget
#> N_AGE               FALSE          NA
#> C_WORKCLASS         FALSE          NA
#> N_FNLWGT            FALSE          NA
#> C_EDUCATION         FALSE          NA
#> N_EDUCATIONNUM      FALSE          NA
#> C_MARITALSTATUS     FALSE          NA

# Count any errors in naming
sum(naming_check$ErrorType, na.rm = TRUE)
#> [1] 0

A value of FALSE for ErrorType means the naming convention is correctly followed for that variable.

Using Naming Conventions with targeter

Let’s run a target analysis using our conventionally named variables:

# Run targeter with naming conventions enabled
tar_with_naming <- targeter(
  data = adult_renamed,
  target = "Z_ABOVE50K",
  naming_conventions = TRUE,
  verbose = FALSE
)
#> 
#> INFO:target Z_ABOVE50K detected as type: binary
#> INFO:binary target contains number, automatic chosen level: 1; override using `target_reference_level`

# Examine variable classifications
head(tar_with_naming$variables[, c("variable", "var_type", "respect_naming_convention")])
#>           variable    var_type respect_naming_convention
#>             <char>      <char>                    <lgcl>
#> 1:     C_EDUCATION categorical                      TRUE
#> 2: C_MARITALSTATUS categorical                      TRUE
#> 3: C_NATIVECOUNTRY categorical                      TRUE
#> 4:    C_OCCUPATION categorical                      TRUE
#> 5:          C_RACE categorical                      TRUE
#> 6:  C_RELATIONSHIP categorical                      TRUE

By setting naming_conventions = TRUE, the targeter() function:

Automatically detects variable types based on prefixes
Validates that prefixes match actual data types
Optimizes processing for each variable type
Provides a more consistent analysis output

Using Metadata in Reports

The metadata we created can be used to generate professional reports with descriptive labels.

This is as simple as passing as parameter the metadata we just created to the targeter() function.

# Generate a report with metadata
report(
  tar_with_naming,
  metadata = adult_metadata,
  title = "Income Analysis using Naming Conventions",
  author = "Data Science Team"
)