10 Ways To Mess Up Your Data
source link: https://www.tuicool.com/articles/Ufemam6
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Don’t: Combine data types in one column
Most commonly seen as: “51 (22%)”. Stop. Make a second column for percentages, or put them in another table, or just calculate them on the fly when you need them. Don’t mix these things together, since you’ll never be able to perform operations on them.
Sigh: Randomly vary column naming conventions
If one table says “Unit Price” another should not say “Price Per Unit”. Another common offender: randomly using abbreviations, or not, or using different abbreviations, i.e. “Student Admissions” or “Stu Adm” or “StdAdmiss”. Please stop.
Help: Randomly change capitalization and case rules
As above, don’t switch between “Price per Unit”, “Price Per Unit”, “price_per_unit” and “price.per.unit”. Pick something and stick with it.
Please no: Place spaces and weird characters in column names
No parentheses, no slashes, no hashes. For the love of god no commas or tabs. Ideally, avoid any whitespace, so I don’t have to quote every column name.
Why? Label a column with “new” or “old”
Someone, at some time, thought this was a good idea. They also maybe thought they would come back in a year and get rid of the old column. Well now it’s too late, because fifty queries have already been written using “boss_custom_roi_formula_new” for critical reports. What happens next year? “boss_custom_roi_formula_newnew2updated”.
Use a specific date, if that matters. Use some other versioning or control code. Or update the old calculation and adjust the metadata appropriately. Just don’t do this.
Groan: Remove NAs because they’re ugly
In count data, it can be common and legitimate to impute a zero from absent data. But in most other cases, the data really are not known, and must be handled specially. In some calculations they would be included, in some not; don’t decide now, at the data stage — let later users, including yourself, handle the NAs appropriately. You’re not doing them a favor by stripping them out.
Ugh: Put percentage symbols or unit symbols in your data column
Then they don’t get treated as numbers, and you’ll have to parse it out into two values later. Encode 50% as 0.5, and so on.
Newbie tip: Forget to specify units
The above doesn’t mean you should ignore units entirely. For instance, you might have a “length” column. Length in what? Meters? Inches? Context can’t always tell you, and you won’t remember in three years. Store it in another column or in the metadata.
Stop: Create overlapping and vague bins
Say you export data for a histogram and bin the data, like this:
Bin 1: 1–10
Bin 2: 10–20
Bin 3: 20–30
Now what the heck bin does 20 go into? Bin #2 or bin #3? I have no idea! Specify the exact included values, or the left/right handedness, or use bracket notation.
Worst for last: Dump the data and run
What I mean is, don’t save a random flatfile with some values somewhere, with no indication of what it is, where it came from, who made it, or how it could be updated in the future.
You need documentation, probably in the form of metadata. Somewhere there should be some indication of:
- The data source . Not just the entity responsible for the data, but how you accessed it and when. That access can take various forms. Did you make a special request of someone? Was it published in a standard report? Or did you run a SQL query? If you ran that query, where is it?
- Describe what assumptions and limitations and definitions apply to these data. At least identify anything ambiguous that could trip someone else up. E.g., the data say “sales” but the table is only about sales in the Eastern division — make that crystal clear.
- When the data were recorded (separate from when you accessed them). Is the number of users recorded at the end of the calendar year, or the beginning? When you did a survey? The fiscal year end? This may vary between types of data, and could matter later!
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK