56

Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR

 4 years ago
source link: https://www.tuicool.com/articles/hit/JvIFzq7
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

(This article was first published on R – Giga thoughts … , and kindly contributed toR-bloggers)

“There’s something so paradoxical about pi. On the one hand, it represents order, as embodied by the shape of a circle, long held to be a symbol of perfection and eternity. On the other hand, pi is unruly, disheveled in appearance, its digits obeying no obvious rule, or at least none that we can perceive. Pi is elusive and mysterious, forever beyond reach. Its mix of order and disorder is what makes it so bewitching. ” 

From  Infinite Powers by Steven Strogatz

Anybody who wants to be “anybody” in Big Data must necessarily be able to work on both large structured and unstructured data.  Log analysis is critical in any enterprise which is usually unstructured. As I mentioned in my previous post Big Data: On RDDs, Dataframes,Hive QL with Pyspark and SparkR-Part 3 RDDs are typically used to handle unstructured data. Spark has the Dataframe abstraction over RDDs which performs better as it is optimized with the Catalyst optimization engine. Nevertheless, it is important to be able to process with RDDs.  This post is a continuation of my 3 earlier posts on Big Data namely

1. Big Data-1: Move into the big league:Graduate from Python to Pyspark

2. Big Data-2: Move into the big league:Graduate from R to SparkR

3. Big Data: On RDDs, Dataframes,Hive QL with Pyspark and SparkR-Part 3

This post uses publicly available Webserver logs from NASA. The logs are for the months Jul 95 and Aug 95 and are a good place to start unstructured text analysis/log analysis. I highly recommend parsing these publicly available logs with regular expressions. It is only when you do that the truth of Jamie Zawinski’s pearl of wisdom

“Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.” – Jamie Zawinksi

hits home. I spent many hours struggling with regex!!

For this post for the RDD part,  I had to refer to Dr. Fisseha Berhane’s blog post Webserver Log Analysis and for the Pyspark part, to the Univ. of California Specialization which I had done 3 years back Big Data Analysis with Apache Spark . Once I had played around with the regex for RDDs and PySpark I managed to get SparkR and SparklyR versions to work.

The notebooks used in this post have been published and are available at

  1. logsAnalysiswithRDDs
  2. logsAnalysiswithPyspark
  3. logsAnalysiswithSparkRandSparklyR

You can also download all the notebooks from Github at WebServerLogsAnalysis

An essential and unavoidable aspect of Big Data processing is the need to process unstructured text.Web server logs are one such area which requires Big Data techniques to process massive amounts of logs. The Common Log Format also known as the NCSA Common log format, is a standardized text file format used by web servers when generating server log files. Because the format is standardized, the files can be readily analyzed.

A publicly available webserver logs is the NASA-HTTP Web server logs. This is good dataset with which we can play around to get familiar to handling web server logs. The logs can be accessed at NASA-HTTP

DescriptionThese two traces contain two month’s worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida.

FormatThe logs are an ASCII file with one line per request, with the following columns:

-host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.

-timestamp in the format “DAY MON DD HH:MM:SS YYYY”, where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.

-request given in quotes.

-HTTP reply code.

-bytes in the reply.

1 Parse Web server logs with RDDs

1.1 Read NASA Web server logs

Read the logs files from NASA for the months Jul 95 and Aug 95

from pyspark import SparkContext , SparkConf

from pyspark.sql import SQLContext

conf = SparkConf().setAppName("Spark-Logs-Handling").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)

sqlcontext = SQLContext(sc)
rdd = sc.textFile("/FileStore/tables/NASA_access_log_*.gz")
rdd.count()
Out[1]: 3461613

1.2Check content

Check the logs to identify the parsing rules required for the logs

i = 0

for line in rdd.sample(withReplacement = False, fraction = 0.00001, seed = 100).collect():
    i=i+1
    print(line)
    if i >5:
      break

ix-stp-fl2-19.ix.netcom.com – – [03/Aug/1995:23:03:09 -0400] “GET /images/faq.gif HTTP/1.0” 200 263

slip183-1.kw.jp.ibm.net – – [04/Aug/1995:18:42:17 -0400] “GET /shuttle/missions/sts-70/images/DSC-95EC-0001.gif HTTP/1.0” 200 107133

piweba4y.prodigy.com – – [05/Aug/1995:19:17:41 -0400] “GET /icons/menu.xbm HTTP/1.0” 200 527

ruperts.bt-sys.bt.co.uk – – [07/Aug/1995:04:44:10 -0400] “GET /shuttle/countdown/video/livevideo2.gif HTTP/1.0” 200 69067

dal06-04.ppp.iadfw.net – – [07/Aug/1995:21:10:19 -0400] “GET /images/NASA-logosmall.gif HTTP/1.0” 200 786

p15.ppp-1.directnet.com – – [10/Aug/1995:01:22:54 -0400] “GET /images/KSC-logosmall.gif HTTP/1.0” 200 1204

1.3 Write the parsing rule for each of the fields

  • host
  • timestamp
  • path
  • status
  • content_bytes

1.21 Get IP address/host name

This regex is at the start of the log and includes any non-white characted

import re

rslt=(rdd.map(lambda line: re.search('\S+',line)
   .group(0))
   .take(3)) # Get the IP address \host name
rslt

Out[ 3 ]: [‘in24.inetnebr.com’, ‘uplherc.upl.com’, ‘uplherc.upl.com’]

1.22 Get timestamp

Get the time stamp

rslt = ( rdd . map ( lambda line : re . search ( ‘(\S+ -\d{4})’ , line )

.groups())
    .take(3))  #Get the  date
rslt

[(‘[01/Aug/1995:00:00:01 -0400’,),

(‘[01/Aug/1995:00:00:07 -0400’,),

(‘[01/Aug/1995:00:00:08 -0400’,)]

1.23 HTTP request

Get the HTTP request sent to Web server \w+ {GET}

# Get the REST call with ” “

rslt=(rdd.map(lambda line: re.search('"\w+\s+([^\s]+)\s+HTTP.*"',line)
    .groups())
    .take(3)) # Get the REST call
rslt

[(‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’,),

(‘/’,),

(‘/images/ksclogo-medium.gif’,)]

1.23Get HTTP response status

Get the HTTP response to the request

rslt=(rdd.map(lambda line: re.search('"\s(\d{3})',line)
    .groups())
    .take(3)) #Get the status
rslt

Out[ 6 ]: [(‘200’,), (‘304’,), (‘304’,)]

1.24 Get content size

Get the HTTP response in bytes

rslt = ( rdd . map ( lambda line : re . search ( ‘^.*\s(\d*)$’ , line )

.groups())
    .take(3)) # Get the content size
rslt

Out[ 7 ]: [(‘1839’,), (‘0’,), (‘0’,)]

1.24 Putting it all together

Now put all the individual pieces together into 1 big regular expression and assign to the groups

  1. Host 2. Timestamp 3. Path 4. Status 5. Content_size
rslt=(rdd.map(lambda line: re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line)
    .groups())
    .take(3))
rslt

[(‘in24.inetnebr.com’,

‘ -‘,

‘ ‘,

‘-‘,

‘[01/Aug/1995:00:00:01 -0400]’,

‘”GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0″‘,

‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’,

‘200 1839’,

‘1839’),

(‘uplherc.upl.com’,

‘ -‘,

‘ ‘,

‘-‘,

‘[01/Aug/1995:00:00:07 -0400]’,

‘”GET / HTTP/1.0″‘,

‘/’,

‘304 0’,

‘0’),

(‘uplherc.upl.com’,

‘ -‘,

‘ ‘,

‘-‘,

‘[01/Aug/1995:00:00:08 -0400]’,

‘”GET /images/ksclogo-medium.gif HTTP/1.0″‘,

‘/images/ksclogo-medium.gif’,

‘304 0’,

‘0’)]

1.25 Add a log parsing function

import re

def parse_log1(line):
    match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line)
    if match is None:    
        return(line,0)
    else:
        return(line,1)

1.26 Check for parsing failure

Check how many lines successfully parsed with the parsing function

n_logs = rdd.count()
failed = rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1] == 0).count()
print('Out of a total of {} logs, {} failed to parse'.format(n_logs,failed))
# Get the failed records line[1] == 0
failed1=rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1]==0)
failed1.take(3)

Out of a total of 3461613 logs, 38768 failed to parse

Out[10]:

[(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘,

0),

(‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘,

0),

(‘pipe1.nyc.pipeline.com – – [01/Aug/1995:00:12:37 -0400] “GET /history/apollo/apollo-13/apollo-13-patch-small.gif” 200 12859’,

0)]

1.26 The above rule is not enough to parse the logs

It can be seen that the single rule only parses part of the logs and we cannot group the regex separately. There is an error “AttributeError: ‘NoneType’ object has no attribute ‘group’” which shows up

#rdd.map(lambda line: re.search(‘^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s(“\w+\s+([^\s]+)\s+HTTP.*”)\s(\d{3}\s(\d*)$)’,line[0]).group()).take(4)

File “/databricks/spark/python/pyspark/util.py”, line 99, in wrapper

return f(*args, **kwargs)

File “ ”, line 1, in AttributeError: ‘NoneType’ object has no attribute ‘group’

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:490)

1.27 Add rule for parsing failed records

One of the issues with the earlier rule is the content_size has “-” for some logs

import re

def parse_failed(line):
    match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line)
    if match is None:        
        return(line,0)
    else:
        return(line,1)

1.28 Parse records which fail

Parse the records that fails with the new rule

failed2 = rdd . map ( lambda line : parse_failed ( line )). filter ( lambda line : line [ 1 ] == 1 )

failed2.take(5)

Out[13]:

[(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘,

1),

(‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘,

1),

(‘tia1.eskimo.com – – [01/Aug/1995:00:28:41 -0400] “GET /pub/winvn/release.txt HTTP/1.0” 404 -‘,

1),

(‘itws.info.eng.niigata-u.ac.jp – – [01/Aug/1995:00:38:01 -0400] “GET /ksc.html/facts/about_ksc.html HTTP/1.0” 403 -‘,

1),

(‘grimnet23.idirect.com – – [01/Aug/1995:00:50:12 -0400] “GET /www/software/winvn/winvn.html HTTP/1.0” 404 -‘,

1)]

1.28 Add both rules

Add both rules for parsing the log.

Note it can be shown that even with both rules all the logs are not parse.Further rules may need to be added

import re

def parse_log2(line):
    # Parse logs with the rule below
    match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line)
    # If match failed then use the rule below
    if match is None:
        match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line)
    if match is None:
        return (line, 0) # Return 0 for failure
    else:
        return (line, 1) # Return 1 for success

1.29 Group the different regex to groups for handling

def map2groups ( line ):

match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line)
    if match is None:
        match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(-)$',line)    
    return(match.groups())

1.30 Parse the logs and map the groups

parsed_rdd = rdd . map ( lambda line : parse_log2 ( line )). filter ( lambda line : line [ 1 ] == 1 ). map ( lambda line : line [ 0 ])

parsed_rdd2 = parsed_rdd . map ( lambda line : map2groups ( line ))

2. Parse Web server logs with Pyspark

2.1Read data into a Pyspark dataframe

import os

logs_file_path="/FileStore/tables/" + os.path.join('NASA_access_log_*.gz')
from pyspark.sql.functions import split, regexp_extract
base_df = sqlContext.read.text(logs_file_path)
#base_df.show(truncate=False)
from pyspark.sql.functions import split, regexp_extract
split_df = base_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
                          regexp_extract('value', r'^.*\[(\d\d\/\w{3}\/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
                          regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
                          regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
                          regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_df.show(5,truncate=False)

+———————+————————–+———————————————–+——+————+

|host |timestamp |path |status|content_size|

+———————+————————–+———————————————–+——+————+

|199.72.81.55 |01/Jul/1995:00:00:01 -0400|/history/apollo/ |200 |6245 |

|unicomp6.unicomp.net |01/Jul/1995:00:00:06 -0400|/shuttle/countdown/ |200 |3985 |

|199.120.110.21 |01/Jul/1995:00:00:09 -0400|/shuttle/missions/sts-73/mission-sts-73.html |200 |4085 |

|burger.letters.com |01/Jul/1995:00:00:11 -0400|/shuttle/countdown/liftoff.html |304 |0 |

|199.120.110.21 |01/Jul/1995:00:00:11 -0400|/shuttle/missions/sts-73/sts-73-patch-small.gif|200 |4179 |

+———————+————————–+———————————————–+——+————+

only showing top 5 rows

2.2 Check data

bad_rows_df = split_df . filter ( split_df [ ‘host’ ]. isNull () |

split_df['timestamp'].isNull() |
                              split_df['path'].isNull() |
                              split_df['status'].isNull() |
                             split_df['content_size'].isNull())
bad_rows_df.count()

Out[ 20 ]: 33905

2.3Check no of rows which do not have digits

We have already seen that the content_type field has ‘-‘ instead of digits in RDDs

# bad_content_size_df = base_df . filter ( ~ base_df [ ‘value’ ]. rlike ( r’\d+$’ ))

bad_content_size_df.count()

Out[ 21 ]: 33905

2.4 Add ‘*’ to identify bad rows

To identify the rows that are bad, concatenate ‘*’ to the content_size field where the field does not have digits. It can be seen that the content_size has ‘-‘ instead of a valid number

from pyspark . sql . functions import lit , concat

bad_content_size_df.select(concat(bad_content_size_df['value'], lit('*'))).show(4,truncate=False)

+—————————————————————————————————————————————————+

|concat(value, *) |

+—————————————————————————————————————————————————+

|dd15-062.compuserve.com – – [01/Jul/1995:00:01:12 -0400] “GET /news/sci.space.shuttle/archive/sci-space-shuttle-22-apr-1995-40.txt HTTP/1.0” 404 -*|

|dynip42.efn.org – – [01/Jul/1995:00:02:14 -0400] “GET /software HTTP/1.0” 302 -* |

|ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:02:40 -0400] “GET /software/winvn HTTP/1.0” 302 -* |

|ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:03:24 -0400] “GET /software HTTP/1.0” 302 -* |

+—————————————————————————————————————————————————+

2.5 Fill NAs with 0s

# Replace all null content_size values with 0.

cleaned_df = split_df . na . fill ({ ‘content_size’ : 0 })

3. Webserver  logs parsing with SparkR

library ( SparkR )

library(stringr)
file_location = "/FileStore/tables/NASA_access_log_Jul95.gz"
file_location = "/FileStore/tables/NASA_access_log_Aug95.gz"
# Load the SparkR library


# Initiate a SparkR session
sparkR.session()
sc <- sparkR.session()
sqlContext <- sparkRSQL.init(sc)
df <- read.text(sqlContext,"/FileStore/tables/NASA_access_log_Jul95.gz")

#df=SparkR::select(df, "value")
#head(SparkR::collect(df))
#m=regexp_extract(df$value,'\\\\S+',1)

a=df %>% 
  withColumn('host', regexp_extract(df$value, '^(\\S+)', 1)) %>%
  withColumn('timestamp', regexp_extract(df$value, "((\\S+ -\\d{4}))", 2)) %>%
  withColumn('path', regexp_extract(df$value, '(\\"\\w+\\s+([^\\s]+)\\s+HTTP.*")', 2))  %>%
  withColumn('status', regexp_extract(df$value, '(^.*"\\s+([^\\s]+))', 2)) %>%
  withColumn('content_size', regexp_extract(df$value, '(^.*\\s+(\\d+)$)', 2))
#b=a%>% select(host,timestamp,path,status,content_type)
head(SparkR::collect(a),10)

1 199.72.81.55 – – [01/Jul/1995:00:00:01 -0400] “GET /history/apollo/ HTTP/1.0” 200 6245

2 unicomp6.unicomp.net – – [01/Jul/1995:00:00:06 -0400] “GET /shuttle/countdown/ HTTP/1.0” 200 3985

3 199.120.110.21 – – [01/Jul/1995:00:00:09 -0400] “GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0” 200 4085

4 burger.letters.com – – [01/Jul/1995:00:00:11 -0400] “GET /shuttle/countdown/liftoff.html HTTP/1.0” 304 0

5 199.120.110.21 – – [01/Jul/1995:00:00:11 -0400] “GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0” 200 4179

6 burger.letters.com – – [01/Jul/1995:00:00:12 -0400] “GET /images/NASA-logosmall.gif HTTP/1.0” 304 0

7 burger.letters.com – – [01/Jul/1995:00:00:12 -0400] “GET /shuttle/countdown/video/livevideo.gif HTTP/1.0” 200 0

8 205.212.115.106 – – [01/Jul/1995:00:00:12 -0400] “GET /shuttle/countdown/countdown.html HTTP/1.0” 200 3985

9 d104.aa.net – – [01/Jul/1995:00:00:13 -0400] “GET /shuttle/countdown/ HTTP/1.0” 200 3985

10 129.94.144.152 – – [01/Jul/1995:00:00:13 -0400] “GET / HTTP/1.0” 200 7074

host timestamp

1 199.72.81.55 [01/Jul/1995:00:00:01 -0400

2 unicomp6.unicomp.net [01/Jul/1995:00:00:06 -0400

3 199.120.110.21 [01/Jul/1995:00:00:09 -0400

4 burger.letters.com [01/Jul/1995:00:00:11 -0400

5 199.120.110.21 [01/Jul/1995:00:00:11 -0400

6 burger.letters.com [01/Jul/1995:00:00:12 -0400

7 burger.letters.com [01/Jul/1995:00:00:12 -0400

8 205.212.115.106 [01/Jul/1995:00:00:12 -0400

9 d104.aa.net [01/Jul/1995:00:00:13 -0400

10 129.94.144.152 [01/Jul/1995:00:00:13 -0400

path status content_size

1 /history/apollo/ 200 6245

2 /shuttle/countdown/ 200 3985

3 /shuttle/missions/sts-73/mission-sts-73.html 200 4085

4 /shuttle/countdown/liftoff.html 304 0

5 /shuttle/missions/sts-73/sts-73-patch-small.gif 200 4179

6 /images/NASA-logosmall.gif 304 0

7 /shuttle/countdown/video/livevideo.gif 200 0

8 /shuttle/countdown/countdown.html 200 3985

9 /shuttle/countdown/ 200 3985

10 / 200 7074

4 Webserver logs parsing with SparklyR

Installing package into ‘/databricks/spark/R/lib’

#install.packages(“sparklyr”)

5 Hosts

5.1  RDD

parsed_rdd = rdd . map ( lambda line : parse_log2 ( line )). filter ( lambda line : line [ 1 ] == 1 ). map ( lambda line : line [ 0 ])

x [ 0 ], 1 )) . reduceByKey ( lambda a , b : a + b ) . takeOrdered ( 10 , lambda x : - x [ 1 ])) rslt

Out[18]:

[(‘piweba3y.prodigy.com’, 21988),

(‘piweba4y.prodigy.com’, 16437),

(‘piweba1y.prodigy.com’, 12825),

(‘edams.ksc.nasa.gov’, 11962),

(‘163.206.89.4’, 9697),

(‘news.ti.com’, 8161),

(‘www-d1.proxy.aol.com’, 8047),

(‘alyssa.prodigy.com’, 8037),

(‘siltb10.orl.mmc.com’, 7573),

(‘www-a2.proxy.aol.com’, 7516)]

5.12Plot counts of hosts

import seaborn as sns

import pandas as pd import matplotlib . pyplot as plt df = pd . DataFrame ( rslt , columns = [ ‘host’ , ‘count’ ]) sns . barplot ( x = ‘host’ , y = ‘count’ , data = df ) plt . subplots_adjust ( bottom = 0.6 , right = 0.8 , top = 0.9 ) plt . xticks ( rotation = “vertical” , fontsize = 8 ) display ()

5.2 PySpark

5.22 Plot count of hosts

5.3 SparkR

5.32 Plot count of hosts


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK