| Title: | Privacy-Preserving Synthetic Data for 'LLM' Workflows | 
| Version: | 0.2.2 | 
| Description: | Generate privacy-preserving synthetic datasets that mirror structure, types, factor levels, and missingness; export bundles for 'LLM' workflows (data plus 'JSON' schema and guidance); and build fake data directly from 'SQL' database tables without reading real rows. Methods are related to approaches in Nowok, Raab and Dibben (2016) <doi:10.32614/RJ-2016-019> and the foundation-model overview by Bommasani et al. (2021) <doi:10.48550/arXiv.2108.07258>. | 
| License: | MIT + file LICENSE | 
| URL: | https://zobaer09.github.io/FakeDataR/, https://github.com/zobaer09/FakeDataR | 
| BugReports: | https://github.com/zobaer09/FakeDataR/issues | 
| Encoding: | UTF-8 | 
| RoxygenNote: | 7.3.2 | 
| Imports: | dplyr, jsonlite, zip | 
| Suggests: | readr, testthat (≥ 3.0.0), knitr, rmarkdown, DBI, RSQLite, tibble, nycflights13, palmerpenguins, gapminder, arrow, withr | 
| VignetteBuilder: | knitr, rmarkdown | 
| Config/testthat/edition: | 3 | 
| Language: | en-US | 
| NeedsCompilation: | no | 
| Packaged: | 2025-09-30 03:48:13 UTC; Zobaer Ahmed | 
| Author: | Zobaer Ahmed [aut, cre] | 
| Maintainer: | Zobaer Ahmed <zunnun09@gmail.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2025-10-06 08:10:19 UTC | 
Detect sensitive columns by name
Description
Uses a broad, configurable regex library to match likely PII columns.
You can extend it with extra_patterns (they get ORed in) or replace
everything with a single override_regex.
Usage
detect_sensitive_columns(x_names, extra_patterns = NULL, override_regex = NULL)
Arguments
| x_names | Character vector of column names to check. | 
| extra_patterns | Character vector of additional regexes to OR in. Examples: c("MRN", "NHS", "Aadhaar", "passport") | 
| override_regex | Optional single regex string that fully replaces the
defaults (case-insensitive). When supplied,  | 
Value
Character vector of names from x_names that matched.
Examples
detect_sensitive_columns(c("id","email","home_phone","zip","notes"))
detect_sensitive_columns(names(mtcars), extra_patterns = c("^vin$", "passport"))
Save a fake dataset to disk
Description
Save a data.frame to CSV, RDS, or Parquet based on the file extension.
Usage
export_fake(x, path)
Arguments
| x | A data.frame (e.g., output of  | 
| path | File path. Supported extensions:  | 
Value
(Invisibly) the path written.
Generate Fake Data from Real Dataset Structure
Description
Generate Fake Data from Real Dataset Structure
Usage
generate_fake_data(
  data,
  n = 30,
  category_mode = c("preserve", "generic", "custom"),
  numeric_mode = c("range", "distribution"),
  column_mode = c("keep", "generic", "custom"),
  custom_levels = NULL,
  custom_names = NULL,
  seed = NULL,
  verbose = FALSE,
  sensitive = NULL,
  sensitive_detect = TRUE,
  sensitive_strategy = c("fake", "drop"),
  normalize = TRUE
)
Arguments
| data | A tabular object; will be coerced via  | 
| n | Rows to generate (default 30). | 
| category_mode | One of "preserve","generic","custom". 
 | 
| numeric_mode | One of "range","distribution". 
 | 
| column_mode | One of "keep","generic","custom". 
 | 
| custom_levels | optional named list of allowed levels per column (for | 
| custom_names | optional named character vector old->new (for
 | 
| seed | Optional RNG seed. | 
| verbose | Logical; print progress. | 
| sensitive | Optional character vector of original column names to treat as sensitive. | 
| sensitive_detect | Logical; auto-detect common sensitive columns by name. | 
| sensitive_strategy | One of "fake","drop". Only applied if any sensitive columns exist. | 
| normalize | Logical; lightly normalize inputs (trim, %→numeric, short date-times→POSIXct). | 
Value
A data.frame of n rows with attributes:
-  name_map(named chr: original -> output)
-  column_mode(chr)
-  sensitive_columns(chr; original names)
-  dropped_columns(chr; original names that were dropped)
Generate fake data from a DB schema data.frame
Description
Generate fake data from a DB schema data.frame
Usage
generate_fake_from_schema(sch_df, n = 30, seed = NULL)
Arguments
| sch_df | A data.frame returned by  | 
| n | Number of rows to generate. | 
| seed | Optional integer seed for reproducibility. | 
Value
A base data.frame with n rows and one column per schema
entry. Column classes follow the schema type values
(integer, numeric, character, logical, Date, POSIXct);
missingness is injected when nullable is TRUE.
Generate a Fake POSIXct Column
Description
Create synthetic timestamps either by mimicking an existing POSIXct vector
(using its range and NA rate) or by sampling uniformly between start and end.
Usage
generate_fake_posixct_column(
  like = NULL,
  n = NULL,
  start = NULL,
  end = NULL,
  tz = "UTC",
  na_prop = NULL
)
Arguments
| like | Optional POSIXct vector to mimic. If supplied,  | 
| n | Number of rows to generate. Required when  | 
| start,end | Optional POSIXct bounds to sample between when  | 
| tz | Timezone to use if  | 
| na_prop | Optional NA proportion to enforce in the output (0–1). If  | 
Value
A POSIXct vector of length n.
Generate fake data with privacy controls
Description
Generates a synthetic copy of data, then optionally detects/handles
sensitive columns by name. Detection uses the ORIGINAL column names and
maps to output via attr(fake, "name_map") if present.
Usage
generate_fake_with_privacy(
  data,
  n = 30,
  level = c("low", "medium", "high"),
  seed = NULL,
  sensitive = NULL,
  sensitive_detect = TRUE,
  sensitive_strategy = c("fake", "drop"),
  normalize = TRUE,
  sensitive_patterns = NULL,
  sensitive_regex = NULL
)
Arguments
| data | A data.frame (or coercible) to mirror. | 
| n | Rows to generate (default same as input if NULL). | 
| level | One of "low","medium","high". | 
| seed | Optional RNG seed. | 
| sensitive | Character vector of original column names to treat as sensitive. | 
| sensitive_detect | Logical; auto-detect common sensitive columns by name. | 
| sensitive_strategy | One of "fake" or "drop". | 
| normalize | Logical; lightly normalize inputs. | 
| sensitive_patterns | Optional named list of patterns to treat as sensitive (e.g., list(id = "...", email = "...", phone = "...")). Overrides defaults. | 
| sensitive_regex | Optional fully-combined regex (single string) to detect sensitive columns by name. If supplied, it is used instead of defaults. | 
Details
Generate fake data with privacy controls
Value
data.frame with attributes: sensitive_columns, dropped_columns, name_map
Create a copy-paste prompt for LLMs
Description
Create a copy-paste prompt for LLMs
Usage
generate_llm_prompt(
  fake_path,
  schema_path = NULL,
  notes = NULL,
  write_file = TRUE,
  path = dirname(fake_path),
  filename = "README_FOR_LLM.txt"
)
Arguments
| fake_path | Path to the fake data file (CSV/RDS/Parquet). | 
| schema_path | Optional path to the JSON schema. | 
| notes | Optional extra notes to append for the analyst/LLM. | 
| write_file | Write a README txt next to the files? Default TRUE. | 
| path | Output directory for the README if write_file = TRUE. | 
| filename | README file name. Default "README_FOR_LLM.txt". | 
Value
The prompt string (invisibly returns the file path if written).
Create a fake-data bundle for LLM workflows
Description
Generates fake data, writes files (CSV/RDS/Parquet), writes a scrubbed JSON schema, and optionally writes a README prompt and a single ZIP file containing everything.
Usage
llm_bundle(
  data,
  n = 30,
  level = c("medium", "low", "high"),
  formats = c("csv", "rds"),
  path = tempdir(),
  filename = "fake_bundle",
  seed = NULL,
  write_prompt = TRUE,
  zip = FALSE,
  prompt_filename = "README_FOR_LLM.txt",
  zip_filename = NULL,
  sensitive = NULL,
  sensitive_detect = TRUE,
  sensitive_strategy = c("fake", "drop"),
  normalize = FALSE
)
Arguments
| data | A data.frame (or coercible) to mirror. | 
| n | Number of rows in the fake dataset (default 30). | 
| level | Privacy level: "low", "medium", or "high". Controls stricter defaults. | 
| formats | Which data files to write: any of "csv","rds","parquet". | 
| path | Folder to write outputs. Default:  | 
| filename | Base file name (no extension). Example: "demo_bundle". This becomes files like "demo_bundle.csv", "demo_bundle.rds", etc. | 
| seed | Optional RNG seed for reproducibility. | 
| write_prompt | Write a README_FOR_LLM.txt next to the data? Default TRUE. | 
| zip | Create a single zip archive containing data + schema + README? Default FALSE. | 
| prompt_filename | Name for the README file. Default "README_FOR_LLM.txt". | 
| zip_filename | Optional custom name for the ZIP file (no path).
If  | 
| sensitive | Character vector of column names to treat as sensitive (optional). | 
| sensitive_detect | Logical, auto-detect common sensitive columns (id/email/phone). Default TRUE. | 
| sensitive_strategy | "fake" (replace with realistic fakes) or "drop". Default "fake". | 
| normalize | Logical; if TRUE, attempt light auto-normalization before faking. | 
Details
Tips
Avoid using angle brackets in examples; prefer plain tokens like NAME
or FILE_NAME. If you truly want bracket glyphs, use Unicode ⟨name⟩ ⟩name⟨.
Value
List with paths: $data_paths (named), $schema_path, $readme_path (optional), $zip_path (optional), and $fake (data.frame).
Build an LLM bundle directly from a database table
Description
Reads just the schema from table on conn, synthesizes n fake rows,
writes a schema JSON, fake dataset(s), and a README prompt, and optionally
zips them into a single archive.
Usage
llm_bundle_from_db(
  conn,
  table,
  n = 30,
  level = c("medium", "low", "high"),
  formats = c("csv", "rds"),
  path = tempdir(),
  filename = "fake_from_db",
  seed = NULL,
  write_prompt = TRUE,
  zip = FALSE,
  zip_filename = NULL,
  sensitive_strategy = c("fake", "drop")
)
Arguments
| conn | A DBI connection. | 
| table | Character scalar: table name to read. | 
| n | Number of rows in the fake dataset (default 30). | 
| level | Privacy level: "low", "medium", or "high". Controls stricter defaults. | 
| formats | Which data files to write: any of "csv","rds","parquet". | 
| path | Folder to write outputs. Default:  | 
| filename | Base file name (no extension). Example: "demo_bundle". This becomes files like "demo_bundle.csv", "demo_bundle.rds", etc. | 
| seed | Optional RNG seed for reproducibility. | 
| write_prompt | Write a README_FOR_LLM.txt next to the data? Default TRUE. | 
| zip | Create a single zip archive containing data + schema + README? Default FALSE. | 
| zip_filename | Optional custom name for the ZIP file (no path).
If  | 
| sensitive_strategy | "fake" (replace with realistic fakes) or "drop". Default "fake". | 
Value
Invisibly, a list with useful paths:
-  schema_path– schema JSON
-  files– vector of written fake-data files
-  zip_path– zip archive path (ifzip = TRUE)
Examples
if (requireNamespace("DBI", quietly = TRUE) &&
    requireNamespace("RSQLite", quietly = TRUE)) {
  con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
  on.exit(DBI::dbDisconnect(con), add = TRUE)
  DBI::dbWriteTable(con, "cars", head(cars, 20), overwrite = TRUE)
  out <- llm_bundle_from_db(
    con, "cars",
    n = 100, level = "medium",
    formats = c("csv","rds"),
    path = tempdir(), filename = "db_bundle",
    seed = 1, write_prompt = TRUE, zip = TRUE
  )
}
Prepare Input Data: Coerce to data.frame and (optionally) normalize values
Description
Converts common tabular objects to a base data.frame, and if normalize = TRUE
it applies light, conservative value normalization:
- Converts common date/time strings to POSIXct (best-effort across several formats) 
- Converts percent-like character columns (e.g. "85%") to numeric (85) 
- Maps a configurable set of "NA-like" strings to - NA, while keeping common survey responses like "not applicable" or "prefer not to answer" as real levels
- Normalizes yes/no character columns to an ordered factor - c("no","yes")
Usage
prepare_input_data(
  data,
  normalize = TRUE,
  na_strings = c("", "NA", "N/A", "na", "No data", "no data"),
  keep_as_levels = c("not applicable", "prefer not to answer", "unsure"),
  percent_detect_threshold = 0.6,
  datetime_formats = c("%m/%d/%Y %H:%M:%S", "%m/%d/%Y %H:%M",
    "%Y-%m-%d %H:%M:%S", "%Y-%m-%d %H:%M", "%Y-%m-%dT%H:%M:%S",
    "%Y-%m-%dT%H:%M", "%m/%d/%Y", "%Y-%m-%d")
)
Arguments
| data | An object coercible to  | 
| normalize | Logical, run value normalization step (default  | 
| na_strings | Character vector that should become  | 
| keep_as_levels | Character vector that should be kept as values (not  | 
| percent_detect_threshold | Proportion of non-missing values that must contain  | 
| datetime_formats | Candidate formats tried (in order) when parsing date-times strings.
The best-fitting format (most successful parses) is used. Defaults cover
 | 
Value
A base data.frame.
Extract a table schema from a DB connection
Description
Returns a data frame describing the columns of a database table.
Usage
schema_from_db(conn, table, level = c("medium", "low", "high"))
Arguments
| conn | A DBI connection. | 
| table | Character scalar: table name to introspect. | 
| level | Privacy preset to annotate in schema metadata: one of "low", "medium", "high". Default "medium". | 
Value
A data.frame with column metadata (e.g., name, type).
Examples
if (requireNamespace("DBI", quietly = TRUE) &&
    requireNamespace("RSQLite", quietly = TRUE)) {
  con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
  on.exit(DBI::dbDisconnect(con), add = TRUE)
  DBI::dbWriteTable(con, "mtcars", mtcars[1:3, ])
  sc <- schema_from_db(con, "mtcars")
  head(sc)
}
Validate a fake dataset against the original
Description
Compares classes, NA/blank proportions, and simple numeric ranges.
Usage
validate_fake(original, fake, tol = 0.15)
Arguments
| original | data.frame | 
| fake | data.frame (same columns) | 
| tol | numeric tolerance for proportion differences (default 0.15) | 
Value
data.frame summary by column
Zip a set of files for easy sharing
Description
Zip a set of files for easy sharing
Usage
zip_llm_bundle(files, zipfile)
Arguments
| files | Character vector of file paths. | 
| zipfile | Path to the zip file to create. | 
Value
The path to the created zip file.