Fake Data for Testing — What's Safe, What's Not

A developer pulls a subset of production users into a staging database to test a new feature. Names, emails, addresses, phone numbers — all real. The staging server has weaker access controls than production. Six months later, the staging database leaks in a misconfigured S3 bucket, and now 50,000 people's personal data is public. This scenario isn't hypothetical. It's the kind of breach that shows up in GDPR enforcement actions regularly.

Using fake data in test environments isn't just good practice. Depending on your jurisdiction, it may be a legal requirement.

Why Real Data in Test Environments Is a Problem

GDPR (EU/EEA)

The General Data Protection Regulation restricts processing personal data to specific, documented purposes. If you collected someone's data to provide a service, using it to test an unrelated feature is a different purpose. You'd need a separate legal basis for it — and "we needed test data" isn't one of the six lawful bases.

GDPR also requires data minimization: don't process more personal data than necessary. A staging database with full production records violates this principle by definition. Fines for GDPR violations scale up to 4% of annual global revenue or €20 million, whichever is higher. Meta was fined €1.2 billion in 2023 for data transfer violations. Smaller companies have been fined in the hundreds of thousands for less.

CCPA (California)

The California Consumer Privacy Act gives residents the right to know how their data is used. If your privacy policy says data is collected "to provide services" and you're also using it for internal testing, that's a discrepancy. CCPA doesn't have GDPR's "lawful basis" framework, but it does require transparency about data use, and it does have enforcement teeth — the California Attorney General has brought actions against companies for unauthorized data use.

The Practical Risk

Legal compliance aside, test environments are almost always less secure than production. Staging servers have fewer access controls, logging is often disabled, and data is routinely copied to developer laptops. Every copy of real data that exists outside production is another attack surface. If your production database has good security but your test data is a copy of production, your effective security is whatever your weakest test environment has.

Tools for Generating Fake Data

Faker.js (JavaScript/TypeScript)

Faker.js (@faker-js/faker) is the most widely used fake data library in the JavaScript ecosystem. It generates realistic-looking names, addresses, phone numbers, emails, company names, lorem ipsum text, dates, and dozens of other data types. It supports locales — faker.locale = 'de' generates German-format addresses and names.

import { faker } from '@faker-js/faker';

const user = {
  name: faker.person.fullName(),
  email: faker.internet.email(),
  address: faker.location.streetAddress(true),
  phone: faker.phone.number(),
};

Faker.js is ideal for seeding databases with thousands of records. Set the seed (faker.seed(12345)) for reproducible test data across environments.

Factory Bot (Ruby), Hypothesis (Python), AutoFixture (C#)

Every major language has fake data libraries. Ruby's Factory Bot integrates with Rails test suites. Python's hypothesis goes beyond fake data into property-based testing — it generates inputs designed to find edge cases. C#'s AutoFixture creates test objects with random but valid data.

The common thread: use the library your language already has rather than copying production data. It takes maybe 30 minutes to set up a fake data generator for your schema, and it eliminates an entire class of compliance risk.

Our Tool

Our fake address generator generates plausible addresses, names, and phone numbers for 8 countries. It's designed for quick one-off test data — copy a fake address into a form to test validation, grab a fake phone number for a mock API response. For bulk data generation (thousands of records), use a programmatic tool like Faker.js.

What Makes Fake Data Plausible Enough

Fake data needs to be realistic enough that your application processes it the same way it would process real data. If your fake addresses don't match the format your address parser expects, you're testing the generator, not your application.

Addresses should follow the format conventions of the target country. US addresses need a street number, street name, city, two-letter state abbreviation, and 5-digit ZIP code. UK addresses use a different structure (flat/house number, street, city, county, postcode in the XX## #XX format). Japanese addresses reverse the order entirely (postal code, prefecture, city, district, block, building).

Phone numbers should use valid-looking formats but avoid real numbers. In the US, the 555 area code prefix is officially reserved by NANPA for fictional use. Numbers in the range 555-0100 through 555-0199 are guaranteed to be unassigned and safe for testing. Outside that range, a 555 number might actually belong to a real service (555-1212 is directory assistance in most areas). For international numbers, use the country's format but substitute the local part with obviously fake digits.

Email addresses should use domains that won't route to real mailboxes. Use clearly reserved testing formats:

reserved documentation domains that your team controls or recognizes
test — reserved TLD, will never resolve
.invalid — explicitly for testing

Never use a real email provider with a random local part. You might hit a real person's inbox. Use an address clearly marked invalid, such as [email protected].

Credit card numbers for testing should use the designated test numbers provided by payment processors:

| Provider | Test Number | Notes | |----------|-------------|-------| | Stripe | 4242 4242 4242 4242 | Visa, always succeeds | | Stripe | 4000 0000 0000 0002 | Always declines | | Braintree | 4111 1111 1111 1111 | Visa sandbox | | PayPal sandbox | 4032 0385 8800 2118 | Visa |

These numbers pass Luhn check validation but are recognized by the payment processor's sandbox as test cards. Never generate random numbers that pass Luhn — you might accidentally produce a real card number. Always use the processor's official test numbers.

Social Security Numbers should never be generated randomly, even for fake data. The SSA has assigned enough numbers that random generation risks hitting a real one. Use obviously invalid formats instead: numbers starting with 900-999 were historically unused (though the SSA randomized assignment in 2011, making this less reliable). The safest approach is to use a fixed placeholder like 000-00-0000 (which the SSA has never assigned) or skip the field entirely in test data.

Anonymization vs. Synthetic Data

If you absolutely must use production data structure for testing, there are two approaches:

Anonymization replaces identifying fields in real records with fake values. Replace names with random names, hash email addresses, randomize phone numbers. The advantage: the data relationships are real (user A really has 3 orders, user B really has 17), so your tests exercise realistic data distributions. The risk: anonymization is surprisingly hard to do well. Research has repeatedly shown that "anonymized" datasets can be re-identified by cross-referencing with other data sources. Netflix's "anonymized" movie ratings were de-anonymized by matching against public IMDb ratings in 2007.

Synthetic data generates entirely new records from scratch with no connection to real people. Tools like Faker.js, Gretel, or Mostly AI can produce synthetic datasets that match the statistical properties of your production data (same distribution of ages, geographic spread, order frequency) without containing any real records. This is the safer approach — there's nothing to re-identify because no real person's data was ever involved.

For most applications, fully synthetic data generated by Faker.js or similar tools is sufficient. You don't need production-realistic distributions to test that your address form validates correctly or that your API returns the right HTTP status codes.

Checklist Before You Ship Test Data

Before committing any test data to a repository or deploying it to a test environment:

No real names, emails, or phone numbers. Even "just for testing."
Email domains are reserved (.invalid or a clearly documented test-only domain).
Phone numbers use 555-01xx (US) or the country's fictional range.
Credit card numbers are from the processor's official test list, not randomly generated.
No real SSNs, passport numbers, or government IDs. Use fixed placeholders.
Addresses are plausible but fictional. Use our fake address generator or Faker.js.
Test data is marked as fake in the database or with a field flag, so it can't leak into production analytics.

Fake data takes minutes to generate. Cleaning up a breach takes months. The math is straightforward.