21

10 Ways To Mess Up Your Data

 4 years ago
source link: https://www.tuicool.com/articles/Ufemam6
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Don’t: Combine data types in one column

Most commonly seen as: “51 (22%)”. Stop. Make a second column for percentages, or put them in another table, or just calculate them on the fly when you need them. Don’t mix these things together, since you’ll never be able to perform operations on them.

Sigh: Randomly vary column naming conventions

If one table says “Unit Price” another should not say “Price Per Unit”. Another common offender: randomly using abbreviations, or not, or using different abbreviations, i.e. “Student Admissions” or “Stu Adm” or “StdAdmiss”. Please stop.

Help: Randomly change capitalization and case rules

As above, don’t switch between “Price per Unit”, “Price Per Unit”, “price_per_unit” and “price.per.unit”. Pick something and stick with it.

Please no: Place spaces and weird characters in column names

No parentheses, no slashes, no hashes. For the love of god no commas or tabs. Ideally, avoid any whitespace, so I don’t have to quote every column name.

Why? Label a column with “new” or “old”

Someone, at some time, thought this was a good idea. They also maybe thought they would come back in a year and get rid of the old column. Well now it’s too late, because fifty queries have already been written using “boss_custom_roi_formula_new” for critical reports. What happens next year? “boss_custom_roi_formula_newnew2updated”.

Use a specific date, if that matters. Use some other versioning or control code. Or update the old calculation and adjust the metadata appropriately. Just don’t do this.

Groan: Remove NAs because they’re ugly

In count data, it can be common and legitimate to impute a zero from absent data. But in most other cases, the data really are not known, and must be handled specially. In some calculations they would be included, in some not; don’t decide now, at the data stage — let later users, including yourself, handle the NAs appropriately. You’re not doing them a favor by stripping them out.

Ugh: Put percentage symbols or unit symbols in your data column

Then they don’t get treated as numbers, and you’ll have to parse it out into two values later. Encode 50% as 0.5, and so on.

Newbie tip: Forget to specify units

The above doesn’t mean you should ignore units entirely. For instance, you might have a “length” column. Length in what? Meters? Inches? Context can’t always tell you, and you won’t remember in three years. Store it in another column or in the metadata.

Stop: Create overlapping and vague bins

Say you export data for a histogram and bin the data, like this:

Bin 1: 1–10

Bin 2: 10–20

Bin 3: 20–30

Now what the heck bin does 20 go into? Bin #2 or bin #3? I have no idea! Specify the exact included values, or the left/right handedness, or use bracket notation.

Worst for last: Dump the data and run

What I mean is, don’t save a random flatfile with some values somewhere, with no indication of what it is, where it came from, who made it, or how it could be updated in the future.

You need documentation, probably in the form of metadata. Somewhere there should be some indication of:

  • The data source . Not just the entity responsible for the data, but how you accessed it and when. That access can take various forms. Did you make a special request of someone? Was it published in a standard report? Or did you run a SQL query? If you ran that query, where is it?
  • Describe what assumptions and limitations and definitions apply to these data. At least identify anything ambiguous that could trip someone else up. E.g., the data say “sales” but the table is only about sales in the Eastern division — make that crystal clear.
  • When the data were recorded (separate from when you accessed them). Is the number of users recorded at the end of the calendar year, or the beginning? When you did a survey? The fiscal year end? This may vary between types of data, and could matter later!

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK