BIG DATA AND HADOOP

INTRODUCTION Many companies are struggling to manage the massive amounts of data they collect. Whereas in the past they may have used a data warehouse platform, such conventional architectures can fall short for dealing with data originating from numerous internal and external sources and often varying in structure and types of content. But new technologies have emerged to offer help -- most prominently, Hadoop, a distributed processing framework designed to address the volume and complexity of big data environments involving a mix of structured, unstructured and semi-structured data. Part of Hadoop's allure is that it consists of a variety of open source software components and associated tools for capturing, processing, managing and analyzing data. But, many vendors offer commercial Hadoop distributions that provide performance and functionality enhancements over the base Apache open source technology and bundle the software with maintenance and support services. This article comprises of some trendy news about Big Data and Hadoop, which includes new software been promoted for implementing Hadoop, security threats imposed on the following operation and how business are incorporating the practice of big Data to transform industries. APACHE ARROW Hadoop, Spark and Kafka have already had a defining influence on the world of big data, and now there’s yet another Apache project with the potential to shape the landscape even further: Apache Arrow. The Apache Software Foundation launched Arrow as a top-level full-fledged project designed to provide a high-performance data layer for columnar in-memory analytics across disparate systems. The project is backed by the founders of Dremio, who are also the force behind Apache Drill. Initially seeded by code from the Apache Drill project, Apache Arrow was built on top of a number of Open Source collaborations, and establishes a de-facto standard for columnar in-memory processing and interchange. Code committers to Apache Arrow include developers from Apache Big Data projects Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu (incubating), Parquet, Phoenix, Spark, and Storm as well as established and emerging Open Source projects such as Pandas and Ibis. Arrow code is available now for implementation in C, C++, Python, and java, with future implementations due in 1 to 2 months for R, Javascript, and Julia, according to Jacques Nadeau, VP of the Arrow and Drill projects at the Apache Software Foundation and the co-founder and CTO of open source big data startup Dremio. "My role in driving this is getting all the users on the same page," he said. "The core of Arrow is making processing systems faster," Nadeau quoted. Arrow does this by enabling different big data components to talk to each other more easily. It does this by creating an internal representation of each big data system component so that data does not have to be copied and converted as it moves from Spark to Cassandra, or from Apache Drill to Kudu, for example.Arrow also features columnar in-memory complex analytics. This is basically fusion of columnar data storage (like that provided by Apache Parquet), with systems that hold data in memory (like SAP HANA and Apache Spark), adding complex hierarchical and nested data structures (like JSON). Arrow improves on CPU performance by lining up data to match CPU instructions and cache locality, thus streamlining the flow of data into the CPU. The CPU can stick to processing rather than searching and pulling data from the cache. This data alignment also permits use of superword and SIMD Instructions (Single Instruction Multiple Data), which also boosts performance. Optimizing cache locality, data pipelines, and SIMD, performance gains of 10x to 100x can be achieved, Nadeau said. Apache Arrow should come into its own as users tap different tools for different missions in the realm of big data, Nadeau pointed out. "We are making each work load more efficient…[Arrow] will change a lot of things," he said. SECURITY THREATS Recently question arises to security for big data analytics. Here's why: When you can't analyze in place, you need to copy that data -- at which point all the stipulations for data manipulation should be replicated, too. Today, that's nearly impossible to do. As a way out policy-based approach can be adopted that has arisen in the broader security market. A brief visit to the history of access control and how it evolved to produce a policy-based model can be successful in prospecting into the idea. Initially the system commenced with using username/password combination to disallow everyone from access point. An inherent problem originated with this system. The combination of username/password tended to explode as new applications were written, so we ended up using different combination for different applications. Worse, some applications asked for different passwords to reach different levels of security. We became smarter and divided up “roles” from usernames to access the administrative functions. However, each application tended to implement this on its own, so we still had a growing list of passwords to remember. We became even smarter and created central systems that eventually became LDAP,Active Directory, and the like--- but this replaced one problem with another. In reality, most applications think of roles differently, and besides, simply because you’re an admin for one application doesn’t mean you should be an admin for another. Which begs the question: Who ends up in charge of adding new roles? It tends to be either some IT-administrative or shared-HR function. A quick fix is presented in the form of Policy based model, whose security exists quite often in a central repository and relies on central authentication mechanisms (LDAP, Kerberos, and so on). The difference is, instead of maintaining simple roles, each user is associated with a set of policies. The policies are based on a set of attributes about the user, also known as attribute-based access control (ABAC). Those policies cannot be centrally enforced as they are entirely application-dependent. BIG DATA & INSURANCE Tony Almeida, Insights & Data North America Lead Insurance Analytics practice of Capgemini’s Global Financial Services Business Unit, discusses how an increasing number of connected devices and new data sources are transforming traditional insurance models like never before. Tony has led the creation of BI/Analytics solutions and delivery for 20 years. In his tenure, Tony was in the forefront of Predictive Analytics solutions since late 2005. Big data solutions have become persuasive in the insurance industry. They have enabled insurers to take advantage of advanced analytic to concentrate data, especially customer data, from a broader range of sources and extract more information they can use to improve pricing, claims settlement, risk options and more customer-centric products. However, in addition to offering vast opportunities, this also raises objection, most distinctly around managing risk at the customer-centric, product-creation level. As customers “bounce from one product to another” in search of the best policy, insurers can influence the data insights from customer changes and feedback to inform engagement decisions from customer distribution. In insurance, efficiency is an important keyword. Insurers must set the price of premiums at a level which ensures them a profit by covering their risk, but also fits with the budget of the customer. Many insurers now offer telemetry-based packages, where actual driving information is fed back to their system to a personalized, highly accurate profile of an individual customer’s behavior can be built up. US insurers Progressive offer a great example of a business which has committed to working with data to enhance its services. A similar revolution is underway in the world of health and life insurance due to the growing prevalence of wearable technology such as the Apple Watch and Fitbit activity trackers, which can monitor a person’s habits and provide ongoing assessment of their lifestyle and activity levels. According to research by Accenture, a third of insurers are now offering services based on the use of these devices. One of these is John Hancock, which offers users discounts on their premiums and a free Fitbit wearable monitor.

REACH US

QUICK LINKS

FROM THE BLOG

QUICK CONTACT