8000 GitHub - thzll2001/WeblogChallenge: Weblog analysis
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

thzll2001/WeblogChallenge

 
 

Repository files navigation

Processing & Analytical goals:

  1. Sessionize the web log by IP. Sessionize = aggregrate all page hits by visitor/IP during a session. https://en.wikipedia.org/wiki/Session_(web_analytics)

  2. Determine the average session time

  3. Determine unique URL visits per session. To clarify, count a hit to a unique URL only once per session.

  4. Find the most engaged users, ie the IPs with the longest session times

Additional questions for Machine Learning Engineer (MLE) candidates:

  1. Predict the expected load (requests/second) in the next minute

  2. Predict the session length for a given IP

  3. Predict the number of unique URL visits by a given IP

HDP Sandbox: http://hortonworks.com/hdp/downloads/ or CDH QuickStart VM: http://www.cloudera.com/content/cloudera/en/downloads.html http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/access-log-collection.html#access-log-entry-format

About

Weblog analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 94.5%
  • Python 5.5%
0