Culture of data collection during User Research to create an AI Dataset

Published in

11 min readNov 27

There are many debates and solutions on how to store the results of UX research. However you solve this problem, the modern world introduces us with new challenges. Namely, research results in modern companies are part of a common knowledge base of the company. This knowledge base should be able to provide answers to the manager’s questions quickly, ideally without the involvement of a human intermediary. This dictates new requirements for processing and storing responses from questionnaires, customer complaints, insights from interviews, usability metrics, and so on. This article does not pretend to be exhaustive or even instructional. Rather, it serves as an example of solutions to some problems and the process of solving typical problems. In short, I wanted to create a language model-based chatbot or even AGI with easy access to all user experience data.

Preparation of data

You may find yourself in one of two situations. Either your reality is that you have a huge amount of disparate research artifacts, such as interview results, quotes from technical support complaints, and reports from external agencies. Or you have very little data, and you’re working not with “the data you need” but with “the data you somehow managed to collect.”

We’ll leave the GDPR-type rules for processing and storing personal data out of this article. But be aware of the requirements for handling personal data in the countries where you work.

Important idea: if you don’t have a specialist in your team preparing all the data, you don’t have machine learning.

There is no point in training an ML model on poor quality data, it is better to have UX experts review it. So if you are not sure about the quality of the insights, it is better not to add them to the dataset. And if you are sure about the quality of the insights, then break them down into jobs/pains/gains.

Let’s say you’ve conducted SUS, CSI, SUPR-Q surveys for your company and competitors. So, you have the following results:

2022
+--------+-------------+--------------+--------------+
| Metric | Our product | Competitor 1 | Competitor 2 |
+--------+-------------+--------------+--------------+
| SUS    |          67 |           35 |           84 |
| CSI    |           4 |            3 |            4 |
| SUPR-Q |          75 |           47 |           73 |
+--------+-------------+--------------+--------------+

2023
+--------+-------------+--------------+--------------+
| Metric | Our product | Competitor 1 | Competitor 2 |
+--------+-------------+--------------+--------------+
| SUS    |          71 |            - |           64 |
| CSI    |           - |            - |            4 |
| SUPR-Q |          64 |           53 |           85 |
+--------+-------------+--------------+--------------+

Missing values can significantly reduce prediction accuracy, so this problem is prioritized. From a machine learning perspective, estimated or approximated values are more “accurate” to the algorithm than just missing values. Even if you don’t have a value, there are ways to better “guess” the missing value or work around this problem.

Replacing missing values with dummy values, e.g. n/a for categorical values or 0 for numeric values.
Replacing missing numeric values with average values.
In the case of categorical values, the most commonly used elements can be used.

For a more practical example, let’s take a typical task in an Oracle database. If there is a Y value in the table in column X, it needs to be updated, otherwise it needs to be inserted. The column is not a key, so ON DUPLICATE KEY will not work here. Solution::

UPDATE table
SET column = X
WHERE column = Y

Usually, the answers to these types of tasks are easily found in the documentation.

Once the data is prepared on your side, it’s time to use it. Sooner or later you will have a lot of csv/json files. You know, usually most products are able to export JSON or CSV. And you will need to push them to e.g. MSSQL. Engineers, chatGPT, courses may tell you to use Pandas + Python, and they will be wrong. With python, you just download the file to a server with a DBMS and run a SQL query via pyodbc, which uses MSSQL’s built-in tools for working with files and json (bulk insert, openjson).

A conceptual but possibly intractable problem in the above pipeline is the presence of CSVs. Try to avoid CSV whenever possible. It is bad because of the delimiter problem. Since it’s mentioned that CSV is bad, I should propose a solution. I switched to Frictionless Data, which saves the CSV file inside a ZIP container and puts a manifest file with metadata inside that container. This immediately gives you information about what delimiters and field types are present. The output format is called a data package and is used on some scientific data storage systems.

An alternative is the Apache Parquet format. This is a special open format for columnar data storage. Parquet files store columns individually and compress them separately. Their compression level is much higher than that of CSV files, because columns often have only a few values and do not contain unique values, but dictionaries. There we will solve the problem of the weight of CSV files as well.

You don’t have much data

If you don’t have a lot of data, don’t despair. Not all ML require large data sets. First of all, it’s far from certain that you need ML as such. You can quickly get the information you need with simple regular expressions. For example, in a «lifetime-NDA-project», I managed to build only a small dataset (135 samples), but still I was able to get a semantic segmentation of actions that predicted errors of bank operators.

When you give a dataset to ML, the data should be balanced. If you have 200 kittens and only 2 puppies in your dataset, it is not balanced. In the language of CX research, if your dataset has 1,000 insights about the delivery process and 3 insights about the item return process, that’s a clear imbalance. Classification models try to put data into different buckets. In an unbalanced dataset, one bucket makes up a large portion of the training dataset (the majority class), while the other bucket is underrepresented in the dataset (the minority class).

You can also try to train a Boosting or Random Forest on a small amount of data. Conditional linear dependence can be plotted on two points only! In other words, you can take UMUX Lite results + sales results for several years and plot ML on them. Dependencies can also be determined on a small dataset, if it is objective enough. Just remember that a small number of objects increases the risk of overtraining, so we set few parameters.

On storage, for me I stick to the simplest possible approach COPY TO from Apache Cassandra and COPY FROM in PostgreSQL, everything works at real-time speed.

If you want to experiment, don’t forget about synthetic data, i.e. data generated on a simulator. But only if the data can’t be collected. Synthetic data can be different from real data, it is common to mix real data into synthetic data. Convenient service for synthetics sdv.dev.

Alternatively, you can create an artificial dataset yourself. In the code below, we create a dataset with 9 informative features, and 6 features are random linear combinations of these 9 features.

data = make_classification(10000, 
 n_informative = 9, 
 n_redundant = 6, 
 n_classes = 3,
 weights = [0.34, 0.62, 0.24])

Got lots of data

Your ability to experiment with ML depends on whether you have been collecting data for years. Some organizations have been so successful at storing data for decades that they now need a truck to move it to the cloud because there is not enough Internet bandwidth.

When you have a lot of data, you need infrastructure to store and process it. The more data you have, the better it is for making infromed decisions, and the more expensive it is to maintain and store. This is where the main problem becomes maintenance rather than immediate problem solving.

The data needs to be put into an appropriate form. There’s an unpleasant nuance here: affordable outsourced experts for manual data labeling cost the company 10–15 times less than a single skilled developer. Therefore, writing algorithms to automatically label data with the help of a developer/data scientist/engineer is often more expensive than manually labeling an additional amount of data when needed. This leaves a company with a lot of unlabeled data and little labeled data, and that’s what the AI needs to be trained on. And it sounds like I’m suggesting that you partition all the terabytes of data and put it in an array. Pull all the reports from 20 years of the product and break it down into machine-digestible data. After all, we need «big data»! And let’s also use pseudo-partitioning from Kaggle, train on a small amount of partitioned data, and run on unpartitioned data. Or let’s use prepartitioning. Ehhh… Overengineering. The very concept of this approach is wrong.

Yes, of course you need to collect all the data you can. But if you are preparing a dataset with a specific task in mind, it is better to reduce its size. This simply saves resources. For example, banks use SCD2 in their warehouses. In their data storages, it is very expensive to store uninformative rows. When the number of rows is more than a billion, it becomes critical.

For instance, you may want to predict which customers tend to make big purchases from your online store. A shopper’s age, location, and gender can help you to predict purchases better than their credit card number or which country’s VPN they used two years ago. But it works the other way around, too. Think about what other values you’ll need to collect to uncover other dependencies. For example, adding bounce rate can improve the accuracy of conversion prediction.

All this big data will exist in different departments, and even in different collection points within the same department. Market Department may have access to the CRM, but customers in the CRM are not linked to web analytics, and all data can be changed retroactively. In addition, there is a differentiation of CRUD rights at the level of databases/tables. Reports from the UX research department uploaded to Confluence as presentations, and designers keep their democratized questionnaires in Airtable, and each study has its own specifics. B2b designers have everything in EnjoyHQ, b2c designers have everything in Dovetail. If you have many channels of acquisition, maintenance and retention, it is not always possible to unify all data flows in a central repository, but in most cases it can be implemented to a desirable extent.

If you have money but no engineers, and data is stored in a bunch of different services, you can look at out-of-the-box solutions from OWOX BI, Stitch, and many others. Typical end-to-end analytics task. Or if you have engineers, and let’s say you manage to get the right data from different departments. It’s very likely that the engineers will instinctively put all the historical data into one dataset. If we’re talking about Pandas, that would be MultiIndex, which is not a bad thing.

After merging the desired data blocks, data cleansing is required. If you are using an ML as a service platform, data cleansing can be automated. For example, Azure Machine Learning allows you to choose from the available techniques, and Amazon ML does it without your intervention at all. As you can see, we have a very complex processing pipeline piling up.

It can also get more complicated, so if you have a really large amount of data, you might consider putting the data into a warehouse. These warehouses are typically created for structured (or SQL) records, which are stored in standard table formats. And it’s a great solution for large volumes, like 50 billion records per day. It’s not a problem for DWH.

The second option is data lakes and ELT. Data lakes are warehouses that can store both structured and unstructured data, including images, video, audio, PDFs, and so on. Perfect for UX research results. But even if the data is structured, it is not transformed before being stored. The data is loaded in its original form, and decisions about how to use and process it are made as needed. This approach is called extract, load, and transform = ELT.

And a lot of it can sound very complicated. And it’s true, sometimes it’s better to simplify. Avoid overengineering. In one project I was uploading csv to the server (more than 500 000 files per day). The output was 15+ tables with up to 8,000,000 rows that Excel could not digest. Since I like things simple and clear, and I’m a designer, not an engineer, I stuffed everything into a wide flat table (Clickhouse).

It’s been worse. During another project, the data changed extremely frequently and there was no time for normal planning. I just created a table with a JSON field. And every time the source table changed, I converted the row to JSON and stored it with the date in the database.

Storage structure for AI Chat

I can’t give you an ideal structure for storing data. No one can do that but you. It depends on many factors, such as how often data is changed and added, the amount of data, the types of data. The storage structure should be designed according to your needs. Personally, I try to stick to a predetermined number of fields for the final table.

Also, data warehousing is not something you can just delegate. If you just put raw data into a database and then give analysts or chatGPT the task of “do what you want, but tell me how the NPS is related to the current problems”, it’s going to be a hoot and a holler instead of normal work with data. Trust me, after six months, no one will know where anything is and how it is being used. There will be insane data duplication and inefficient storage => resource utilization. And mistakes in reports. Accordingly, the underlying data structure should be controlled by developers and the process of collecting that data should be controlled as well. Creating a data governance culture in an organization is probably the hardest part of the whole process described in this article.

Here’s my template for purchase events::
- userId
- itemId
- eventType
- timestamp
And additional attributes that are optional:
- Price
- Discount
- Quantity

And yes, you will say that it is impossible to predict what data we will need a year from now. There may be many new types of events, they may appear unpredictably and with different attributes. It looks like your data set will turn to mush sooner or later. In today’s unpredictable world this is quite possible, but there has to be a standard. You can take a look at the standard cocodataset.org.

And let’s get to work

With the data sorted out, it is time to examine it. Since we decided at the beginning of this article not to consider any legal restrictions, we are (theoretically) free to play with many things. After all, as designers or managers, we don’t want to perform complex database queries. Tell me honestly, when you red about all these databases, CSVs, JSONs, DWH and so on, did you get bored? Not surprised, because it is. But it does get rewarded.

In all likelihood, the expected UX is a question/answer format, as we have already mentioned at the beginning of this article. The first thing that comes to mind is to use the chatGPT API to generate a list of question/answer records from the pdf/CSV/other files I provide. But lawyers will beat us up for that, so we can limit ourselves to using Haystack or GPT-Neo 2.7B to create a Q&A system.

The above also affects how we plan your user research from now. Are the current findings suitable for enriching the dataset? This will affect how we frame our interview questions, and may even determine the purpose of the research: to enrich the dataset with qualitative insights in a particular scenario to get a better balance in the data. The most important thing to remember is that as designers we still need to think in terms of people, not data sets. We are still not engineers, even if we are better than some engineers in engineering skills.

Culture of data collection during User Research to create an AI Dataset

Culture of data collection during User Research to create an AI Dataset

Preparation of data

You don’t have much data

Got lots of data

Storage structure for AI Chat

And let’s get to work

Recommend

The Nokia 3310: A milestone in mobile design history

🔪 6 Killer Open-Source Libraries to Achieve AI Mastery before 2024 🔥🪄

Swoop to provide stronger connection to 130,000 residents in Mt Baw Baw, Moe, an...

Most design “awards” are not what they seem to be

10 GitHub Repos to Improve as a Backend Engineer

Why Your Website Should Work Offline (And What You Should Do About It)

Socializing design work

All of these tech founders and CEOs stepped away for a stint before returning to...

快速认识，前端必学编程语言：JavaScript

百万付费会员，复购率高达60%，健身赛道新宠「超级猩猩」靠什么让用户买单？

About Joyk