5

Parsing and Validating Dates in Awk

 2 years ago
source link: https://blog.jpalardy.com/posts/parsing-and-validate-dates-in-awk/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Parsing and Validating Dates in Awk

April 13, 2022

I recently stumbled on something that I thought would be easy: parsing and validating dates in Awk.

Some guidelines:

  • ISO 8601 format, e.g. 2022-04-12
  • months between 01 and 12
  • days between 01 and 28-31 (depending on the month, leap year)

A quick lookup for GNU Awk’s time functions points to mktime("YYYY MM DD HH MM SS")

Let’s try it:

> echo "2022 04 12" | awk '{ print mktime($0 " 0 0 0") }'
1649746800

# removed hyphens for now, will fix in solution below
# padded HH MM SS with 0 0 0, to keep mktime happy

Looking good! How about something wrong?

> echo "not a date" | awk '{ print mktime($0 " 0 0 0") }'
-1

Oh yeah! What about a “bad” date?

> echo "2022 44 78" | awk '{ print mktime($0 " 0 0 0") }'
1760684400

Wait, what?! Oh no……..

Going Full Circle

If invalid dates returned -1, we would be done by now.

1760684400 is 2025-10-17mktime takes 44 and 78 and (probably) multiplies those by seconds-per-month, and seconds-per-day.

When I looked at the other time functions, there didn’t seem to be anything that helped either.

The eureka! was to think about using the invalid date to format a date back to ISO 8601 format. If the input and output dates are different, the date is wrong!

# good example
> echo "2022 04 12" | awk '{ d = mktime($0 " 0 0 0"); print strftime("%F", d) }'
2022-04-12

parsing date and formatting the same date

# bad example
> echo "2022 44 78" | awk '{ d = mktime($0 " 0 0 0"); print strftime("%F", d) }'
2025-10-17

Sidenote: I’m using %F to format dates. man 3 strftime says:

%F     Equivalent to %Y-%m-%d (the ISO 8601 date format)

Test Script

Here’s my test cases:

> cat test.txt
2022-04-12 -- regular day
bad_date   -- not even a date
1981-11-20 -- 1980s
2022-44-78 -- nonsense month/day
2022-09-30 -- september has 30 days
2022-09-31 -- but not 31 ...
2016-02-28 -- leap year: february has 28 days
2016-02-29 -- leap year: even 29 days
2016-02-30 -- leap year: but not 30 days
2000-02-28 -- special leap year: february has 28 days
2000-02-29 -- special leap year: even 29 days
2000-02-30 -- special leap year: but not 30 days
2001-02-28 -- regular year: february has 28 days
2001-02-29 -- regular year: but not 29 days
2001-02-30 -- regular year: but not 30 days
1965-04-12 -- past, before 1970
1935-04-12 -- past, before 1970

The Awk script:

# hyphens now removed
> cat test.awk
{
  date = mktime(gensub("-", " ", "g", $1) " 0 0 0")
  if (strftime("%F", date) != $1) {
    print "bad: ", $0
    next
  }
  print "good:", $0
}

Results:

> awk -f test.awk test.txt                                                                                                                         ~/Documents/blog (main)
good: 2022-04-12 -- regular day
bad:  bad_date   -- not even a date
good: 1981-11-20 -- 1980s
bad:  2022-44-78 -- nonsense month/day
good: 2022-09-30 -- september has 30 days
bad:  2022-09-31 -- but not 31 ...
good: 2016-02-28 -- leap year: february has 28 days
good: 2016-02-29 -- leap year: even 29 days
bad:  2016-02-30 -- leap year: but not 30 days
good: 2000-02-28 -- special leap year: february has 28 days
good: 2000-02-29 -- special leap year: even 29 days
bad:  2000-02-30 -- special leap year: but not 30 days
good: 2001-02-28 -- regular year: february has 28 days
bad:  2001-02-29 -- regular year: but not 29 days
bad:  2001-02-30 -- regular year: but not 30 days
good: 1965-04-12 -- past, before 1970
good: 1935-04-12 -- past, before 1970

I was surprised that pre-1970 (epoch) also worked! Their mktime values are negative:

> echo "1935 04 12" | awk '{ print $0, "=>", mktime($0 " 0 0 0")}'
1935 04 12 => -1095782400

> echo "1969 12 31" | awk '{ print $0, "=>", mktime($0 " 0 0 0")}'
1969 12 31 => -57600

> echo "1970 01 01" | awk '{ print $0, "=>", mktime($0 " 0 0 0")}'
1970 01 01 => 28800

Discuss on Twitter


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK