How to Generate Test Data for MongoDB With Python

Generate Test Data for MongoDB With Python For testing purposes, especially if you’re working on a project that uses any database technology to store information, you may need data to try out your project. In that case, you have two options:

Find a good dataset (Kaggle) or
Use a library like Faker

Through this blog post, you will learn how to generate test data for MongoDB using Faker.

Requirements

Dependencies

Make sure all the dependencies are installed before creating the Python script that will generate the data for your project.

You can create a requirements.txt file with the following content:

Shell

pandas

pymongo

faker

Once you have created this file, run the following command:

Shell

pip install -r requirements.txt

Or if you’re using Anaconda, create an environment.yml file:

name: percona

dependencies:

- python=3.10

- pandas

- pymongo

- tqdm

- faker

You can change the Python version as this script has been proven to work with these versions of Python: 3.7, 3.8, 3.9, 3.10, and 3.11.

Run the following statement to configure the project environment:

Shell

conda env create -f environment.yml

Fake data with Faker

Faker is a Python library that can be used to generate fake data through properties defined in the package.

Python

from faker import Faker

fake = Faker()

for _ in range(10):

print(fake.name())

The above code will print ten names, and on each call to method name(), it will produce a random value. The name() is a property of the generator. Every property of this library is called a fake. and there are many of them packaged in providers.

Some providers and properties available in the Faker library include:

faker.providers.person
- name → John Doe
- first_name → Katherine
- last_name → Chang
faker.providers.address
- address → 791 Crist Parks, Sashabury, IL 86039-9874
- city → Sashabury
- country → Hungary
faker.providers.job
- job → Musician
faker.providers.company
- company → Acme Ltd
faker.providers.internet
- email → [email protected]
- company_email → [email protected]

You can find more information on bundled and community providers in the documentation.

Creating a Pandas DataFrame

After knowing Faker and its properties, a modules directory needs to be created, and inside the directory, we will create a module named dataframe.py. This module will be imported later into our main script, and this is where we define the method that will generate the data.

Python

from multiprocessing import cpu_count

import pandas as pd

from tqdm import tqdm

from faker import Faker

Multiprocessing is implemented for optimizing the execution time of the script, but this will be explained later. First, you need to import the required libraries:

pandas. Data generated with Faker will be stored in a Pandas DataFrame before being imported into the database.
tqdm(). Required for adding a progress bar to show the progress of the DataFrame creation.
Faker(). It’s the generator from the faker library.
cpu_count(). This is a method from the multiprocessing module that will return the number of cores available.

Python

fake = Faker()

num_cores = cpu_count() - 1

Faker() creates and initializes a faker generator, which can generate data by accessing the properties.

num_cores is a variable that stores the value returned after calling the cpu_count() method.

Python

def create_dataframe(arg):

x = int(60000/num_cores)

data = pd.DataFrame()

for i in tqdm(range(x), desc='Creating DataFrame'):

data.loc[i, 'first_name'] = fake.first_name()

data.loc[i, 'last_name'] = fake.last_name()

data.loc[i, 'job'] = fake.job()

data.loc[i, 'company'] = fake.company()

data.loc[i, 'address'] = fake.address()

data.loc[i, 'city'] = fake.city()

data.loc[i, 'country'] = fake.country()

data.loc[i, 'email'] = fake.email()

return data

Then we define the create_dataframe() function, where:

x is the variable that will determine the number of iterations of the for loop where the DataFrame is created.
data is an empty DataFrame that will later be fulfilled with data generated with Faker.
Pandas DataFrame.loc attribute provides access to a group of rows and columns by their label(s). In each iteration, a row of data is added to the DataFrame and this attribute allows assigning values to each column.

The DataFrame that is created after calling this function will have the following columns:

Shell

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 first_name 60000 non-null object

1 last_name 60000 non-null object

2 job 60000 non-null object

3 company 60000 non-null object

4 address 60000 non-null object

5 country 60000 non-null object

6 city 60000 non-null object

7 email 60000 non-null object

Connection to the database

Before inserting the data previously generated with Faker, we need to establish a connection to the database, and for doing this the PyMongo library will be used.

Python

from pymongo import MongoClient

uri = "mongodb://user:password@localhost:27017/"

client = MongoClient(uri)

From PyMongo, we import the MongoClient() method.

Don’t forget to replace user, password, localhost, and port (27017) with your authentication details, and save this code in the modules directory and name it as base.py.

What is multiprocessing?

Multiprocessing is a Python module that can be used to take advantage of the CPU cores available in the computer where the script is running. In Python, single-CPU use is caused by the global interpreter lock, which allows only one thread to carry the Python interpreter at any given time, for more information see this blog post.

Imagine that you’re generating 60,000 records, running the script in a single core will take more time than you could expect, since each record is generated one by one within the loop. By implementing multiprocessing, the whole process is divided by the number of cores, so that if your CPU has 16 cores, every core will generate 4,000 records, and this is because only 15 cores will be used as we need to leave one available for avoiding freezing the computer.

To understand better how to implement multiprocessing in Python, I recommend the following tutorials:

Generating your data

All the required modules are now ready to be imported into the main script so it’s time to create the mongodb.py script. First, import the required libraries:

Python

from multiprocessing import Pool

from multiprocessing import cpu_count

import pandas as pd

From multiprocessing, Pool() and cpu_count() are required. The Python Multiprocessing Pool class allows you to create and manage process pools in Python.

Then, import the modules previously created:

Python

from modules.dataframe import create_dataframe

from modules.base import client

Now we create the multiprocessing pool configured to use all available CPU cores minus one. Each core will call the create_dataframe() function and create a DataFrame with 4,000 records, and after each call to the function has finished, all the DataFrames created will be concatenated into a single one.

Python

if __name__ == "__main__":

num_cores = cpu_count() - 1

with Pool() as pool:

data = pd.concat(pool.map(create_dataframe, range(num_cores)))

data_dict = data.to_dict('records')

db = client["company"]

collection = db["employees"]

collection.insert_many(data_dict)

After logging into the MongoDB server, we get the database and the collection where the data will be stored.

And finally, we will insert the DataFrame into MongoDB by calling the insert_many() method. All the data will be stored in a collection named employees.

Run the following statement to populate the database:

Shell

python mongodb.py

DataFrame creation with multiprocessing

It will take just a few seconds to generate the DataFrame with the 60,000 records, and that’s why multiprocessing was implemented.

CPU utilization on Percona Monitoring and Management

Once the script finishes, you can check the data in the database.

Python

use company;

db.employees.count()

The count() function returns the number of records in the employees table.

Shell

60000

Or you can display the records in the employees table:

Shell

db.employees.find().pretty()

Python

"_id" : ObjectId("6363ceeeda5c972cabf558b4"),

"first_name" : "Sherri",

"last_name" : "Phelps",

"job" : "Science writer",

"company" : "House Inc",

"address" : "06298 Mejia Streets Suite 742\nRobertland, WY 98585",

"city" : "Thomasview",

"country" : "Cote d'Ivoire",

"email" : "[email protected]"

The code shown in this blog post can be found on my GitHub account in the data-generator repository.

Requirements

Dependencies

Fake data with Faker

Creating a Pandas DataFrame

Connection to the database

What is multiprocessing?

Generating your data

Recommend

微星发布RTX 4080系列显卡，包括SUPRIM X、GAMING X TRIO和VENTUS 3X

OPPO推出手机云测服务：实现开发者一键多机型自动化测试

高通宣布将与Adobe扩大合作，在骁龙移动、计算和XR终端释放创造力

Stand out in a React interview by rendering a list like a pro

Apple’s iPhone 14 Pro and iPhone 14 Pro Max ship times slip past Christmas

爱攻发布两款电竞显示器：使用Fast-IPS与Fast-VA面板，最高2K@270Hz - 超能网

Using MERGE To Make Your PostgreSQL More Powerful

英伟达GeForce RTX 4080显卡正式开卖：相当部分型号在万元以上，瞬间被秒空

英伟达公布2023财年第三财季财报：游戏业务收入继续下滑，净利润同比下降72%

pandas.DataFrame 转成 numpy.array 以及 C++ 的二进制文件

About Joyk