Skip to main content
Log in

Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

MapReduce has demonstrated itself to be as a highly efficient programming model for processing massive dataset on the distributed system. One of the most important obstacles hindering the performance of MapReduce is data skewness. The presence of data skewness leads to considerable load imbalance on the reducers and performance degradation. In this paper, the problem of how to efficiently accommodate intermediate data to even up the load of all reducers is studied when encountering skewed data. A scalable sampling algorithm is used which it can observe a more precise approximate distribution of the keys by sampling only a small fraction of the intermediate data. Afterwards, it is applied to evaluate the overall distribution of the keys. In addition, we propose a sorted-balance algorithm based on sampling results: sorted-balance algorithm using scalable simple random sampling (SBaSC). This work not only puts forward a load-balanced partitioning strategy, but also proves a significant approximation ratio of SBaSC. The experiments confirm that our solution attains a better execution time and load balancing results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Akoka J, Comyn-Wattiau I, Laoufi N (2017) Research on big data—a systematic mapping study. Comput Stand Interfaces 54:105–115. https://doi.org/10.1016/j.csi.2017.01.004

    Article  Google Scholar 

  2. Alharthi A, Krotov V, Bowman M (2017) Addressing barriers to big data. Bus Horizons 60(3):285–292. https://doi.org/10.1016/j.bushor.2017.01.002

    Article  Google Scholar 

  3. Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519

    Article  Google Scholar 

  4. Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horizons 60(3):293–303. https://doi.org/10.1016/j.bushor.2017.01.004

    Article  Google Scholar 

  5. Big Data (2018) https://en.wikipedia.org/wiki/Big_data

  6. Wu H (2017) Big data management the mass weather logs. In: Smart Computing and Communication, pp 122–132

  7. Vaidya M (2012) Parallel processing of cluster by MapReduce. Int J Distrib Parallel Syst 3:167–179. https://doi.org/10.5121/ijdps.2012.3113

    Article  Google Scholar 

  8. Xu Y, Qu W, Li Z, Liu Z, Ji C, Li Y, Li H (2014) Balancing reducer workload for skewed data using sampling-based partitioning. Comput Electr Eng 40(2):675–687. https://doi.org/10.1016/j.compeleceng.2013.07.001

    Article  Google Scholar 

  9. Gufler B, Augsten N, Reiser A, Kemper A (2012) Load balancing in MapReduce based on scalable cardinality estimates. In: IEEE 28th International Conference on Data Engineering, pp 522–533. https://doi.org/10.1109/icde.2012.58

  10. Meng X (2013) Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, Vol. 28, pp III-531–III-539

  11. DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical skew handling in parallel joins. In: Proceedings of the 18th International Conference on Very Large Data Bases, pp 27–40

  12. Stamos JW, Young HC (1993) A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans Parallel Distrib Syst 4(12):1345–1354. https://doi.org/10.1109/71.250116

    Article  Google Scholar 

  13. Le Y, Liu J, Ergün F, Wang D (2014) Online load balancing for MapReduce with skewed data input. In: IEEE Conference on Computer Communications IEEE INFOCOM 2014, pp 2004–2012. https://doi.org/10.1109/infocom.2014.6848141

  14. Karapiperis D, Verykios VS (2015) Load-balancing the distance computations in record linkage. SIGKDD Explor Newsl 17(1):1–7. https://doi.org/10.1145/2830544.2830546

    Article  Google Scholar 

  15. Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2017.03.013

    Google Scholar 

  16. Vu L, Alaghband G (2015) A load balancing parallel method for frequent pattern mining on multi-core cluster. In: Proceedings of the Symposium on High Performance Computing, pp 49–58

  17. Kwon Y, Balazinska M, Howe B, Rolia J (2010) Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp 75–86. https://doi.org/10.1145/1807128.1807140

  18. Ramakrishnan SR, Swart G, Urmanov A (2012) Balancing reducer skew in MapReduce work-loads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, pp 1–14. https://doi.org/10.1145/2391229.2391245

  19. Gufler B, Augsten N, Reiser A, Kemper A (2011) Handling data skew in MapReduce. In: Proceedings of the 1st International Conference on Cloud Computing and Services Science, CLOSER 2011, pp 1–6

  20. Ibrahim S, Jin H, Lu L, Wu S, He B, Qi L (2010) LEEN: locality/fairness-aware key partitioning for MapReduce in the Cloud. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp 17–24. https://doi.org/10.1109/cloudcom.2010.25

  21. Kwon Y, Balazinska M, Howe B, Rolia J (2012) SkewTune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp 25–36. https://doi.org/10.1145/2213836.2213840

  22. Martha VS, Zhao W, Xu X (2013) h-MapReduce: a framework for workload balancing in MapReduce. In: 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), pp 637–644. https://doi.org/10.1109/aina.2013.48

  23. Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533. https://doi.org/10.1109/TPDS.20-14.2350972

    Article  Google Scholar 

  24. Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-based partitioning in MapReduce for skewed data. In: 2012 Seventh China Grid Annual Conference, pp 1–8. https://doi.org/10.1109/chinagrid.2012.18

  25. Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in Spark computing environment. Future Gener Comput Syst 78:287–301. https://doi.org/10.1016/j.future.2016.06.027

    Article  Google Scholar 

  26. Devore JL (2011) Probability and statistics for engineering and the sciences. Nelson Education, Scarborough

    Google Scholar 

  27. Estimating a Proportion for a small, finite population (2018) https://onlinecourses.science.psu.edu/stat414/node/264

  28. Walpole REMRH, Myers SL, Ye K (2011) Probability statistics for engineers and scientists. Pearson Prentice Hall, Upper Saddle River

    MATH  Google Scholar 

  29. Vitter JS (1985) Random sampling with a reservoir. ACM Trans Math Softw 11(1):37–57. https://doi.org/10.1145/3147.3165

    Article  MathSciNet  MATH  Google Scholar 

  30. Sunter AB (1977) List sequential sampling with equal or unequal probabilities without placement. J R Stat Soc Ser C (Appl Stat) 26(3):261–268. https://doi.org/10.2307/2346966

    MathSciNet  Google Scholar 

  31. Blum M, Floyd RW, Pratt V, Rivest RL, Tarjan RE (1973) Time bounds for selection. J Comput Syst Sci 7(4):448–461. https://doi.org/10.1016/S0022-0000(73)80033-9

    Article  MathSciNet  MATH  Google Scholar 

  32. Graham RL, Lawler EL, Lenstra JK, Kan AHGR (1979) Optimization and approximation in deterministic sequencing and scheduling: a survey. In: Hammer PL, Johnson EL, Korte BH (eds) Annals of discrete mathematics, vol 5. Elsevier, Amsterdam, pp 287–326. https://doi.org/10.1016/S0167-5060(08)70356-X

    Google Scholar 

  33. Graham R (1969) Bounds on multiprocessing timing anomalies. SIAM J Appl Math 17(2):416–429. https://doi.org/10.1137/0117039

    Article  MathSciNet  MATH  Google Scholar 

  34. Kleinberg J, Tardos É (2006) Algorithm design. Pearson/Addison-Wesley, Boston

    Google Scholar 

  35. Williamson DP, Shmoys DB (2011) The design of approximation algorithms. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  36. Jimmy L (2009) The curse of Zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Proceedings of LSDS-IR Workshop

  37. Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley Press, Boston

    Google Scholar 

  38. Apache Spark Examples (2017) https://spark.apache.org/examples.html

  39. Range Partitioner (2017) https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html

  40. Altman DG, Bland JM (1996) Statistics notes: detecting skewness from summary information. BMJ 313(7066):1200

    Article  Google Scholar 

  41. Khatami Z, Hong S, Lee J, Depner S, Chafi H, Ramanujam J, Kaiser H (2017) A load-balanced parallel and distributed sorting algorithm implemented with PGX.D. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 1317–1324. https://doi.org/10.1109/IPDPSW.2017.30

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Rezaee.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gavagsaz, E., Rezaee, A. & Haj Seyyed Javadi, H. Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput 74, 3415–3440 (2018). https://doi.org/10.1007/s11227-018-2391-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2391-9

Keywords

Navigation