Spark vs. Hadoop – a settled dilemma?

It goes without saying that for the last decades a vast majority of institutions, companies, firms and the like, have dealt with the Big Data reality, which required or just forced the urgent necessity to create processing platforms capable of storing and analyzing this vast amount of data. Here is why Hadoop and Spark, later on, around the year 2008, came into picture.

Photo_article_HS_1

Getting acquainted with a number of texts aimed at comparison of the two processing engines, leads to the initial conclusion that the fact of them serving for different purposes, makes them hardly comparable, moreover, they do not even seem to compete with each other,  Spark being a natural development of Hadoop rather than a competitor. It brings to mind a comparison of traditional paper book and an ebook reader such as Kindle. The purpose of both is reading, but storing a huge library of traditional books requires a lot of space while thousands of volumes can be stored in memory of a small device, which additionally provides a handy dictionary, vocabulary builder, or a possibility of acquiring new titles online in no time, to mention just a few of its assets. This however, does not mean that each individual keen on reading will immediately go for purchasing the latter since many, more traditionally inclined, are determined to stick to books despite  quite apparent advantage of the electronic gadget.

Nevertheless, why not make an attempt of pointing at some most conspicuous differences between the two platforms.

SPECIFY YOUR NEEDS

     It should be remembered that the core of Hadoop (referred to as the second generation technology as opposed to Spark which is classified as the third one), lies in its massive storage infrastructure known as Hadoop Distributed File System (HDFS), a device Spark does not have, which is why it needs to be integrated with some sort of file management system, for instance Hadoop. By the way Spark was initially designed to work at the top of Hadoop and still many claim it works best in incorporation with it, though it can also act independently. The real strengths of Spark may be encapsulated in its processing power as well as velocity and flexibility to analyse wide range of workloads such as batch, interactive, iterative or streaming modes, the features Hadoop does not possess. Its processing engine called MapReduce is capable of analysing only one kind of problems (offline/batch mode), yet it could be quite sufficient for those whose reporting requirements  are mostly static, and they do not mind waiting for the  results. (e.g. in the annual Dayton Gray Sort Challenge, Spark was able to sort 100TB of data in 23 minutes whereas it took Hadoop as much as 72 minutes to perform the same task.) Therefore it comes as no surprise that nowadays we can witness a massive migration to Spark as it operates in various types of environments like online (live), offline (batch), machine learning or graphical data, the areas which would pose insurmountable problems in case of Hadoop. Let’s the figures speak for themselves: according to a survey conducted on Spark by Typesafe in 2015; 78% of users declared the necessity of faster processing the larger data sets, 82% opted for Spark to replace MapReduce; 67% already use i for event stream processing yet, on the other hand, 62% of users load data into Spark with Hadoop’s HDFS which not only proves the value of its storage capacity but also shows that the latter is not going to become a thing of the past in the foreseeable future.

Photo_article_HS_2

SOME REAL LIFE APPLICATIONS

     The possibilities of applying Spark for diversified purposes are really impressive. Why not mention just a few:

FIND YOUR FAVOURITES AT NO TIME AT ALL

–  At the very basic level, there is the enormously popular music service Spotify; thanks to the application of Spark, the creation of lists containing the favourite pieces of music, according to user’s preferences, has seen the light; the feature I personally value and greatly enjoy. Similarly, Amazon offers particular kind of literature, for Kindle users which is based on their previous purchases. Above mentioned examples, which make only a drop in the ocean, point at another important feature of Spark, namely the fact that it operates in real time, constantly providing and analyzing the updated information. Such ability to process real time data at the rate of millions events per second makes it a valuable tool, processing enormous environment of social media such as Twitter data or Facebook sharing/posting functions.

A BREAKTHROUGH IN THE MARKET OF COMMERCIALS

–  Firms providing commercials in the Internet may decide which of them the users should be exposed to, all of which can be done with a few clicks, while previously it required visiting various sites one by one, which used to take up a lot of effort, even more importantly, was very time-consuming with rather subjective conclusions in turn. No longer however!

SHARE MARKET REVOLUTION

–   Obviously enough, constant fluctuation is a part and parcel of the share market. Let us suppose the need to analyse the dominating trends throughout the last decade, which, by the way may be done employing Hadoop, provided we have got enough time to wait for the results. Spark turns out to be indispensable however, when it comes to comparing the analysis concerning the past with today’s circumstances, thanks to its ability to operate in the real time. Such a task could be difficult, complicated and very slow if not impossible with Hadoop’s MapReduce. Same refers to the banking industry, which is also based on extremely changeable data. Therefore the application of Spark rather than Hadoop seems to be the most reasonable solution for both of them.

ALL-IMPORTANT SPEED!

        The argument of being speedier certainly marks Sparks’s victory to take matters a bit humorously. It performs in-memory processing of data which is much faster (10 up to 100 times depending on the analyzed factor). It is because there is no time spent on moving data in and out of the disc as it is in case of MapReduce which requires a substantial amount of time to complete such input/output operation. Another time-saving factor is Spark’s easiness of programming which is thanks to high-level operators with Resilient Distributed Datasets (RDD). Last but not least, initially conceived at the University of California, it is now supported and led by Apache Software Foundation and currently employs more workers than any other Apache product. It seems worth knowing that the team includes engineers and developers from e.g. Yahoo, Groupon, Alibaba or Mint.

Photo_article_HS_3

ON THE OTHER HAND …

     In spite of the obvious Apache Spark advantage, it should be remembered that both frameworks complete each other rather than compete. The appearance of Spark to large extent resulted from the existence of Hadoop, which still is and will most probably remain to be valued for its huge storage capacity. Additionally, the maintenance of Spark is more expensive owing to the fact it requires more RAM memory, although in this case the falling prices of RAM memories should not be overlooked. Apache Spark is also said to be a little less secure due to its secret password authentication. As far as the issue of resilience is concerned, Hadoop is naturally resilient to system faults or failures as the data are stored on the disc after every operation. It is the same with Spark thanks to its built-in resiliency with RDD which provides full recovery from any faults or failures that may occur.

CONCLUSION

     It can be hardly denied that Apache Spark makes a more advanced cluster computing engine than Hadoop MapReduce. The former, being much faster, is also capable of handling any kind of requirements including batch, interactive, iteractive streaming or graph modes, whereas MapReduce is limited to batch processing only. This is why the community of Spark users is growing steadily, quickly replacing MapReduce. However, Spark should not be approached as a replacement for Hadoop which is by no means dead. This is because Spark was conceived as the improvement to the once widely popular Hadoop framework, the role of which, particularly due to its storage capacity should not be undermined. After all, using Kindle for reading purposes does not exclude a possession of a huge library of paper books simultaneously!

The article was written by Szymon Kieloch.


Craft your software with us


captcha

Send