INVESTIGATING THE DISTRIBUTIONAL PROPERTY OF THE SESSION WORKLOAD
Keywords:
Web session length, Session workload property, Web log analysisAbstract
Companies now rely on the World Wide Web for communication with their customers. As reliance on web servers grows, the need for companies to better understand the workload placed upon these servers also increases. The session workload unit is a popular unit of measurement used to analyze recorded information from server logs. In fact, many web applications, from shopping carts to online banking systems, require session information to function correctly. Web data mining is also dependent on session workload information. However, the distributional properties of this session workload are not understood. Whether the session workload can be described as a short-tailed or heavy-tailed distribution is a fundamental question for the investigation of the session workload unit. This paper empirically explores claims that the session workload can be described using a heavytailed distribution. The paper concludes that, for the samples used in this paper, a method to accurately determine whether the session workload is drawn from a heavy-tailed distribution does not exist. Hence, the conclusion that they are drawn from such a distribution cannot be made.
Downloads
References
Arlitt, M., Jin, T., A workload characterization study of the 1998 World Cup Web site, IEEE
Network, 14(3), pp30-37, 2000.
Arlitt, M. F. and Williamson, C. L., Internet Web servers: workload characterization and
performance implications. IEEE/ACM Transactions on Networking, Vol.5(5), pp.631-645,
Arlitt, M., Friedrich, R., and Jin, T., Workload characterization of a Web proxy in a cable
modem environment, ACM Sigmetrics Performance Evaluation Review, Vol.27(2), pp25 –
, 1998.
Barford, P., and Crovella, M. E., Generating representative Web workloads for network and
server performance evaluation, Performance SIGMETRICS ’98, pp151-160, 1998.
Barford, P., Bestavros, A., Bradley, A., and Crovella, M., Changes in Web client access
patterns: Characteristics and caching implications. World Wide Web: Special Issue on
Characterization and Performance Evaluation, Vol.2, pp15-28, 1999.
Berendt, B., Mobasher, B., Spiliopoulou, M., Wiltshire, J., Measuring the accuracy of
sessionizers for web usage analysis. Proceedings of the workshop on web mining at the first
SIAM international conference on data mining, pp. 7-14, 2001.
Brockwell, P.; Davis, R., Time Series: theory and Methods, Springer-Verlag, 1991.
Catledge, L.D., Pitkow, J.E., Characterizing browsing strategies in the World-Wide Web,
Proceedings of the Third International World-Wide Web conference on Technology, tools and
applications, pp.1065-1073, 1995.
Chen, Y-T., On the Robustness of Ljung-Box and McLeod-Li Q tests: a simulation study,
Economics Bulletin, Vol. 3(17), pp. 1 – 10, 2002.
Cherkasova, L., Phaal, P., Session-Based Admission Control: A Mechanism for Peak Load
Management of Commercial Web Sites, Transactions on Computers, 51(6), pp. 669-685,
Crovella, M.E., Bestavros, A., Self-Similarity in Word Wide Web Traffic: Evidence and
Possible Causes, IEEE/ACM Transactions on Networking, Vol. 5(6), pp. 835 – 846, 1997.
Davis, R.; Resnick, S., Limit theory for the sample covariance and correlation functions of
moving averages, Annuals of Statistics, Vol. 13, pp. 179 – 195, 1985.
Downey, A.B., Evidence for Long-tailed distributions in the Internet, Proceedings of the 1st
ACM SIGCOMM Workshop on Internet Measurement, pp. 229 – 241, 2001.
Downey A.B., The structural cause of fie size distributions, Proceedings of the IEE/ACM
International Symposium on Modeling, Analysis, and Simulation of Computer and
Telecommunication Systems, pp. 361 – 370, 2001.
Downey, A.B., Lognormal and Pareto Distributions in the Internet, Computer
Communications, Vol. 28(7), pp. 790-801, 2005.
Eirinaki, M., Vazirgiannis, M., Web mining for web personalization, ACM Transactions on
Internet Technology, 3(1), pp. 1-27, 2003.
Feigen, P.D.; Resnick, S.I., Pitfalls of fitting autoregressive models for heavy-tailed time
series, Extremes, Vol. 1(4), pp. 391 – 422, 1999.
Figueiredo, D.R., Jiu, B., Feldmann, A., Misra, V., Towsley, D. Willinger, W., On TCP and
self-similar traffic, Performance Evaluation, Vol. 61, pp. 129 – 141, 2005.
Fisher, N.I., Graphical Methods in Nonparametric Statistics: A Review and Annotated
Bibliography, International Statistical Review, 51, 25-58, 1983.
Gabaix, X., Zipf’s law for cities: an explanation, Quarterly Journal of Economics, Vol.
(3), pp. 739 – 767, 1999.
Goldstein, M.L., Morris, S.A., Yen, G.G., Problems with fitting to the power-law distribution,
European Physics Journal B, Vol. 41, pp. 255- 258, 2004.
Gong, W. Liu, Y. Misra, V. Towsley, D., On the tails of web file size distributions,
Proceedings of the 39th Allerton Conference on Communication, Control and Computing,
Goševa-Popstojanova, K., Mazimdar, S., and Singh, A., “Empirical Study of Session-based
Workload and Reliability for Web Servers”, 15th IEEE International Symposium on Software
Reliability, pp. 403-414, 2004.
Goševa-Popstojanova, K., Singh, A.D., Mazimdar, S., Li, F., Empirical Characterization of
Session–Based Workload and Reliability for Web Servers, Empirical Software Engineering,
Springer Netherlands, Vol. 11(1), pp. 71-117, 2006(a).
Goševa-Popstojanova, K., Li, F., Wang, X., Sangle, A., A Contribution Towards Solving the
Web Workload Puzzle, International Conference on Dependable Systems and Networks
(DSN'06), pp. 505-516, 2006(b).
He, D., and Goker, A., Detecting session boundaries from Web user logs. Proceedings of the
nd Annual Colloquium on Information Retrieval Research, pp.57-66, British Computer
Society, 2000.
Hernández-Campos, F., Marron, J. S., Samorodnitsky, G., and Smith, F. D., Variable heavy
tails in Internet traffic. Performance Evaluation, Vol. 58(2+3), pp. 261-284, 2004.
Hill, B., A simple approach to inference about the tail of a distribution, Annuals of Statistics,
Vol. 3, pp. 1163 – 1774, 1975.
Huntington, P., Nicholas, D., Jamali, H.R., Website usage metrics: A re-assessment of session
data. Information Processing & Management. Vol. 44., pp. 358-372, 2008.
Huynh, T., Miller, J., A Formal Model for the Session Timeout Threshold. Journal of
Information Processing & Management. In Print.
Jansen, D.W. and de Vries, C.G., On the frequency of large stock returns: putting booms and
busts into perspective, Review of Economics and Statistics, Vol. 73, pp. 18 – 24, 1991.
Jansen, B.J., Spink, A., An Analysis of Web Documents Retrieved and Viewed, The 4th
International Conference on Internet Computing, pp.65-69, 2003.
Kristol, D.M., and Montulli, L., HTTP State Management Mechanism, RFC 2965
(http://tools.ietf.org/html/rfc2965), October 2000.
Ljung, G. M. and Box, G. E. P., "On a measure of lack of fit in time series models."
Biometrika 65, pp. 553-564, 1978.
Mahoui, M., Cunningham, S.J., A comparative transaction log analysis of two computing
collections. Lecture Notes in Computer Science. Vol 1923, pp.418-423, 2000.
Mat-Hassan, M., Levene, M., Associating search and navigation behavior through log
analysis. Journal of the American Society for Information Science and Technology, 56(9),
pp.913-934, 2005.
Mobasher, B., Cooley, R., Srivastava, J., Automatic personalization based on Web usage
mining, Communications of the ACM, 43(8) pp. 142-151, 2000.
Mitzenmacher, M., Dynamic Models for File Sizes and Double Pareto Distributions, Internet
Mathematics, Vol 1(3), pp. 305 – 333, 2003.
Nicholas, D., Huntington, P., Lievesley, N., Wasti, A., Evaluating consumer Web site logs:
Case study The Times/Sunday Times Web site. Journal of Information Science, 26(6), pp.
-411, 2000.
Nicholas, D., Huntington, P., Jamali, H.R., Watkinson, A., What deep log analysis tells us
about the impact of big deal, case study OhioLink. Journal of Documentation, 62(4), pp. 482-
2006.
Nicholas, D., Huntington, P., Jamali, H.R., Watkinson, A., The information seeking
behaviour of the users of digital scholarly journals. Information Processing and Management,
(5), pp. 1345-1365. 2006.
Pankratz, A., Forecasting with univariate Box-Jenkins models: Concepts and cases. New
York: John Wiley and Sons, 1983.
Reed, J.W., Jorgensen, M., The Double Pareto-Lognormal Distribution—A New Parametric
Model for Size Distributions, Communications in Statistics – Theory and Methods, pp. 1733 –
, 2004.
Resnick, S.I., Heavy Tail modeling and teletraffic data, The Annuals of Statistics, Vol. 25(5),
pp 1805 – 1849, 1997.
Rezaul, K.M. & Grout, V., A Comparison of Methods for Estimating the Tail Index of
Heavy-tailed Internet Traffic, Proceedings of the 2nd International Joint e-Conference on
Computer, Information, and Systems Sciences, and Engineering, 2006.
Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M., A framework for the evaluation
of session reconstruction heuristics in Web usage analysis. INFORMS Journal of Computing,
(2), pp. 171-190, 2003.
Tian, J., Rudraraju, S., Li, Z., Evaluating Web Software Reliability Based on Workload and
Failure Data Extracted from Server Logs, IEEE Transactions on Software Engineering, Vol.
(11), pp.754-769, 2004.
Tsourti, Z., and Panaretos, J., "Extreme Value Index Estimators and Smoothing Alternatives:
Review and Simulation Comparison”, Athens University of Economics and Business,
Statistics Technical Report No. 149, 2001.
Zipf, G.K., Human Behavior and the principle of least effort, Addison-Wesley, 1949.