Generating test data using zefaker


zefaker is a command-line program that enables you to generate SQL, CSV, Excel and JSON files via a simple Groovy script without need to compile custom code. It is especially useful for generating test files for when you’re developing or testing data pipelines.

zefaker is essentially a Java application and requires Java 8+ to run. It is a command-line program meaning it does not yet currently have a Graphical User Interface (GUI). That will not be a problem since in this tutorial we will go step-by-step on how to use it.

DISCLAIMER: I am the author of zefaker and I created it at Credit Data CRB for generating test data for different data pipelines and development environments. The views in this article are mine alone and zefaker is not an official product of Credit Data CRB.

What you will need

  • About 10 minutes
  • A favorite text editor (one that supports Groovy syntax is a plus)
  • zefaker (version 0.5rc1 at the time of writing)
  • JDK 8+
  • Maven 3.3+

Step 0: Get zefaker

After making sure you have Java installed, proceeed to download zefaker. Find it here. Extract the archive to a directory of your choosing. The current build (v0.5rc1) extracts to a directory named zefaker-shadow, checking the contents of that directory, you should find bin and lib directories.

$ ls zefaker-shadow\
bin/  lib/

$ ls zefaker-shadow/bin
zefaker  zefaker.bat

We are going to use the scripts in the bin directory, but it’s also possible to run zefaker directly via java using java -jar zefaker-shadow/lib/zefaker-all.jar

Step 1: Create a data.groovy file

In order for us to generate a data file using zefaker we need to describe the fields we want to be present in the generated file, including their names and positions (for formats where order matters). We accomplish this by creating a configuration file which uses groovy syntax. So let’s say we want to generate a file with random people; containing their first name, last name, age and their favorite super heros.

First we will create a file named data.groovy

firstname = column(index=0, name="FirstName") // (1)
surname = column(index=1, name="Surname") // (2)
age = column(index=2, name="Age") // (3)
favSuperhero = column(index=3, name="FavSuperHero") // (4)

columns = [
    (firstname): { faker -> faker.name().firstName() }, // (5)
    (surname): { faker -> faker.name().lastName() }, // (6)
    (age): { faker -> faker.number().numberBetween(16, 45) }, // (7)
    (favSuperhero): { faker -> faker.superhero().name() } // (8)
]

generateFrom columns // (9)

The explanation for each line is given below:

  • 1- 4 : these lines define the columns we want to create in our Excel sheet.
  • 5 - this line defines the function that will be used to generate first names.
  • 6 - this line defines the function that will be used to generate surnames
  • 7 - this line defines the function that will be used to generate a random age
  • 8 - this line defines the function that will generate a random superhero name
  • 9 - this line is important for zefaker to work, it tells the program which columns to use to generate the data. It MUST always be present

NOTE: As you can observe the functions are groovy closures (which you can basically think of as functions) and in this case our closures/functions receive an argument of a Faker object from the java-faker project. So any method that you can call on this object is valid for generating random data.

Step 2: Generate the data!

Using the data.groovy file we created in the previous step we can generate 1000 rows of random data in an Excel Workbook file using the following command.

NOTE: I’ve assumed you placed the data.groovy file in the same directory as where you have downloaded the zefaker you downloaded in step 2.

$ ./zefaker-shadow/bin/zefaker -f=data.groovy -output="1k-random-people.xlsx" -sheet="TestPeople" -rows=1000 

This command will create 1000 rows of random data in a file named 1k-random-people.xlsx in the directory you run it in.

Export to SQL

$ ./zefaker-shadow/bin/zefaker -f=data.groovy -output="1k-random-people.sql" -table="people" -rows=1000 -sql 

Export to CSV

$ ./zefaker-shadow/bin/zefaker -f=data.groovy -output="1k-random-people.csv" -rows=1000 -csv 

Export to JSON

$ ./zefaker-shadow/bin/zefaker -f=data.groovy -output="1k-random-people.xlsx" -rows=1000 -json

Bonus : Define custom functions for generating data

Since the configuration file we use is a Groovy script, we can take advantage of the power of Groovy to create custom functions for generating data.

Let’s change our script to add a column named “Last Seen On” to keep track of when we last saw these people:

firstname = column(index=0, name="First Name")
surname = column(index=1, name="Surname")
age = column(index=2, name="Age")
favSuperhero = column(index=3, name="FavSuperHero")
lastSeen = column(index=4, name="LastSeen") // (1)

def lastSeenFunc = { faker -> // (2)
    def year = LocalDate.now().getYear()
    // We have seen these people atleast in the last six months
    def month = faker.number().numberBetween(1, 6)
    return LocalDate.of(year, month, faker.number().numberBetween(1, 28)).format() // (3)
}

columns = [
    (firstname): { faker -> faker.name().firstName() },
    (surname): { faker -> faker.name().lastName() },
    (age): { faker -> faker.number().numberBetween(16, 45) },
    (favSuperhero): { faker -> faker.superhero().name() },
    (lastSeen): lastSeenFunc // (4)
]

generateFrom columns

Observe the changes we have made to the file;

  • 1 - this line defines a new column named “LastSeen”
  • 2 - We define the function for generating the last seen as a closure
  • 3 - we use the java.util.LocalDate class to generate a LocalDate value with the current year and random month and day which is then formatted into ISO8601 format
  • 4 - the last seen column and it’s function have been added to the columns definition.

We can now use this file to generate random individuals with their last seen date using the same command from Step 3. As you can see this gives us access to the power of Groovy for generating data to place in Excel, CSV or SQL. So you could use this to generate all kinds of data.

Conclusion

In this article we have seen how we can generate random data using zefaker - a Java command-line tool. We have seen how to write the configuration file in Groovy to specify how to generate our random data. I hope this was useful.

Reach out on Twitter @zikani03 if you have any comments, suggestions or corrections.

See also