Sunday, April 29, 2018

Domain where Hadoop can be used: RETAIL

Back To Page

Build a 360° View of the Customer

Retailers interact with customers across multiple channels, yet customer interaction and purchase data is often isolated in data siloes. Few retailers can accurately correlate eventual customer purchases with marketing campaigns and online browsing behavior.

Connected Data Platforms gives retailers a single view of customer behavior. It lets them store data longer and identify phases of the customer lifecycle. Analytics increase sales, reduce inventory expenses and retain the best customers.

Analyze Brand Sentiment

Enterprises lack a reliable way to track their brand health. It is difficult to analyze how advertising, competitor moves, product launches or news stories affect the brand. Internal brand studies can be slow, expensive and flawed.

Connected Data Platforms enables quick, unbiased snapshots of brand opinions expressed in social media. Retailers can analyze sentiment from Twitter, Facebook, LinkedIn or industry-specific social media streams. With better understanding of customer perceptions, they can align their communications, products and promotions with those perceptions.

Localize and Personalize Promotion

Retailers that can geo-locate their mobile subscribers can deliver localized and personalized promotions. This requires connections with both historical and real-time streaming data.

Apache Hadoop® and Apache NiFi bring the data together to inexpensively localize and personalize promotions delivered to mobile devices. Retailers can develop mobile apps to notify customers about local events and sales that align with their preferences and geographic location (even down to a particular section in a specific store).

In time for the 2013 holiday shopping season, Macy’s launched a test in two flagship stores with Apple’s iBeacons technology. This article describes how, “down the road, Macy’s might also ping shoppers on a department-by-department basis, possibly telling them about sneaker sales when they’re in the shoe section, or even recommending nearby products.”

Optimize Websites

Online shoppers leave billions of clickstream data trails. Clickstream data can tell retailers the web pages customers visit and what they buy (or what they don’t buy) on their site. But at scale, the huge volume of unstructured weblogs is difficult to ingest, store, refine and analyze for insight. Storing web log data in relational databases is too expensive.

Apache Hadoop can store all web logs, for years, at a low cost. Web retailers use information in that data to understand user paths, do basket analysis, run A/B tests and prioritize site updates. This improves online conversions and increases revenue.

Optimize Store Layouts

In-store layout and product placement affect sales. Retailers often hire extraneous staff to make up for a sub-optimal layout (e.g. “Are you finding what you need?”). Brick-and-mortar stores lack “pre-cash register” data about what in-store shoppers do before they buy. In-store sensors, RFID tags & QR codes can fill that data gap, but they generate a lot of data.

Apache Hadoop can store that huge volume of unstructured sensor and location data. The intelligence allows retailers to reduce costs and simultaneously improve customer in-store satisfaction. This improves same-store sales and customer loyalty.

Domain where Hadoop can be used: PUBLIC SECTOR

Back To Page

Use Machine and Sensor Data to Proactively Maintain Public Infrastructure

Metro Transit of St. Louis (MTL) operates the public transportation system for the St. Louis metropolitan region. Hortonworks Data Platform helps MTL meet their mission by storing and analyzing IoT data from the city’s Smart Buses, which helped the agency cut average cost per mile driven by its buses from $0.92 to $0.43. It achieved that cost reduction while simultaneously doubling the annual miles driven per bus. Hortonworks delivered the MTL solution in partnership with LHP Telematics, an industry leader in creating custom telematics solutions for connected vehicles in the heavy equipment OEM marketplace, transportation, service, and construction fleets. The combined solution is making MTL bus service more reliable–improving the Mean Time Between Failures (MTBF) for metro buses by a factor of five, from four thousand to twenty-one thousand miles.

Similarly, a Department of Defense customer has turned to Hortonworks Connected Data Platforms to enable analytics and preventative maintenance on their fleet of aircraft. With HDP and HDF, the customer is able obtain predictive analytics and actionable intelligence on their platforms. In addition to reduced Total Cost of Ownership, realized results have included tangible improvements in lifecycle management, operational readiness, pilot safety, and supply management.

Enterprise Data Warehouse (EDW) Optimization

The Enterprise Data Warehouse has become a standard component of enterprise data architectures. However, the complexity and volume of data has posed some interesting challenges to the efficiency of the existing EDW solution. Realizing the transformative potential of Big Data depends on an organization’s ability to manage complexity, while leveraging various and disparate new data sources such as social, web, IoT and more. The integration of these new data sources into the existing EDW system is often costly and incredibly complex.

Hortonworks Enterprise Data Warehouse Optimization Solution is the industry’s only turnkey Hadoop-powered Business Intelligence (BI) solution. The EDW Optimization Solutions is powered by Hortonworks Data Platform (HDP®) and technology from partners Syncsort and AtScale. With the EDW Optimization Solution, Public Sector users can extend the value of existing EDW investments and overcome the challenges, risks and costs of introducing new solutions into legacy infrastructure.

The solution can be implemented rapidly, makes fast BI on Hadoop a reality and reduces cost by moving non-critical workloads off the EDW and leveraging the cost-effective archiving in Hadoop.

University Health Care

Difficult challenges and choices face today’s healthcare industry. Researchers, clinicians and administrators have to make important decisions – often without sufficient data. Hortonworks Connected Data Platforms (powered by Apache Hadoop and Apache NiFi) make healthcare data available and actionable.

By partnering with Hortonworks, researchers can access genomic data for new cancer treatments, physicians can monitor patient vitals and sensor data in real time, hospitals can reduce re-admittance rates, and universities can store medical research data forever.

Prevent Fraud and Waste

Explosive data growth has increased the complexity of government agencies attempting to detect fraud waste in abuse, while also efficiently accomplishing their missions. One federal agency with a large pool of beneficiaries turned to Apache Hadoop and the Hortonworks Data Platform to discover fraudulent claims for benefits. The implementation reduced ETL processing from 9 hours to 1 hour, which allowed them to create new data models around fraud, waste and abuse. After significantly increasing the efficiency of their ETL process, the agency used the surplus processing time and resources to triple the data included in its daily processing. Because Hadoop is a “schema on read” system, rather than the traditional “schema on load” platform, the agency now plans to search additional legacy systems and include more upstream contextual data (such as social media and online content) in its analysis. All of this will make it easier to identify and stop fraud, waste, and abuse.

Smart Cities

With the continuing trend of the connected world and requisite big data needs comes big obstacles and even bigger opportunities. City, Local, and State Governments are challenged with establishing and managing an infrastructure built for connected technologies in an ‘Internet of Anything’ environment. These connected devices (sensors, smart meters, medical devices, road telemetry devices, fleet management sensors, emergency response devices, etc) will generate vast amounts of data that need to be processed in real-time to provide valuable insights and actionable intelligence. Additionally, storage and access of this data can provide historical insights and predictive analytics.

With Hortonworks Connected Data Platforms, Public Sector organizations can build a modern Data Analytics platform that is enterprise grade, highly scalable, and multi-tenant. Using Hortonworks Data Flow (HDF), the data from the various sensors and devices can be collected, aggregated, correlated, and processed in real-time and leveraged to perform a desired task. This data is then stored in the Hortonworks Data Platform (HDP) where large volumes of data at petabyte scale can be stored and processed on commodity hardware at much lower cost than traditional systems. Additional nodes can be added with ease to a cluster as the data demand increases.

Single View of a Resource

Whether a Soldier, a Student, or a Military Aircraft, Public Sector customers are overwhelmed with data from various sources and different formats that are often stored in siloed architectures and requiring unique applications and/or complex translations to simply view the data. Correlation of the data in these environments is both complicated and costly. In many instances these systems have no way of communicating.

With Hortonworks Connected Data Platform, Public Sector customers can build an Analytics Data Platform that enables a Single View capability of both Data in Motion and Data at Rest. Real-time data from sensors and other sources (i.e., social media) is collected, logically correlated, and linked while in flight using Hortonworks Data Flow (HDF). Once collected and correlated, it is stored in Hortonworks Data Platform (HDP) where the unmodified data is retained indefinitely and used for future historical analysis and advanced analytics.

Single view of the resource is implemented and enabled through entity resolution. In this process, disparate pieces of data related to the resource are linked using attributes that are unique to respective resource, such as a serial number, tail number, student ID, or social security number.

Domain where Hadoop can be used: PHARMACEUTICAL

Merck Optimizes Vaccine Yields: Striving for the “Golden Batch

Merck optimized its vaccine yields by analyzing manufacturing data to isolate the most important predictive variables for a “golden batch”. Merck’s leaders had long relied on Lean manufacturing to grow volumes and reduce costs, but it became increasingly difficult to discover incremental ways to enhance yields. They looked into Open Enterprise Hadoop for new insights that could further reduce costs and improve yields. Merck turned to Hortonworks for data discovery into records on 255 batches of one vaccine going back 10 years. That data had been distributed across 16 maintenance and building management systems and it included precise sensor data on calibrations settings, air pressure, temperature, and humidity. After pooling all the data into Hortonworks Data Platform and processing 15 billion calculations, Merck had new answers to questions it had been asking for a decade. Among hundreds of variables, the Merck team was able to spot those that optimized yields. The company proceeded to apply those lessons to their other vaccines, with a focus on providing quality drugs at the lowest possible price. Watch Doug Henschen’s InformationWeek interview with George Llado of Merck.

Minimizing Waste Across the Drug Manufacturing Process

One Hortonworks pharmaceutical customer uses HDP for a single view of its supply chain and their self-declared “War on Waste”. The operations team added up the ingredients going into making their drugs, and compared that with the physical product they shipped. They found a big gap between the two and launched their War on Waste, using HDP big data analytics to identify where those valuable resources were going. Once it identifies those root causes of waste, real-time alerts in HDP notify the team when they are at risk of exceeding pre-determined thresholds.

Translational Research: Turning Scientific Studies Into Personalized Medicine

The goal of Translational Research is to apply the results of laboratory research towards improving human health. Hadoop empowers researchers, clinicians, and analysts to unlock insights from translational data to drive evidence-based medicine programs. The data sources for translational research are complex and typically locked in data siloes, making it difficult for scientists to obtain an integrated, holistic view of their data. Other challenges revolve around data latency (the delay in getting data loaded into traditional data stores), handling unstructured and semi-structured types of data, and bridging lack of collaborative analysis between translation and clinical development groups. Researchers are turning to Open Enterprise Hadoop as a cost-effective, reliable platform for managing big data in clinical trials and performing advanced analytics on integrated translational data. HDP allows translational and clinical groups to combine key data from sources such as: Omics (genomics, proteomics, transcription profiling, etc) Preclinical data Electronic lab notebooks Clinical data warehouses Tissue imaging data Medical devices and sensors File sources (such as Excel and SAS) Medical literature Through Hadoop, analysts can build a holistic view that helps them understand biological response and molecular mechanisms for compounds or drugs. They’re also able to uncover biomarkers for use in R&D and clinical trials. Finally, they can be assured that all data will be stored forever, in its native format, for analysis with multiple future applications.

Next Generation Sequencing

IT systems cannot economically store and process next generation sequencing (NGS) data. For example, primary sequencing results are in large image format and are too costly to store over the long term. Point solutions have lacked the flexibility to keep up with changing analytical methodologies, and are often expensive to customize and maintain. Open Enterprise Hadoop overcomes those challenges by helping data scientists and researchers unlock insights from NGS data while preserving the raw results on a reliable, cost-effective platform. NGS scientists are discovering the benefits of large-scale processing and analysis delivered by HDP components such as Apache Spark. Pharmaceutical researchers are using Hadoop to easily ingest diverse data types from external sources of genetic data, such as TCGA , GENBank , and EMBL. Another clear advantage of HDP for NGS is that researchers have access to cutting-edge bioinformatics tools built specifically for Hadoop. These enable analysis of various NGS data formats, sorting of reads, and merging of results. This takes NGS to the next level through: Batch processing of large NGS data sets Integration of internal with publically available external sequence data Permanent data storage for large image files, in their native format Substantial cost savings on data processing and storage.

HDP Uses Real-World Data to Deliver Real-World Evidence

Real-World Evidence (RWE) promises to quantify improvements to health outcomes and treatments, but this data must be available at scale. High data storage and processing costs, challenges with merging structured and unstructured data, and an over-reliance on informatics resources for analysis-ready data have all slowed the evolution of RWE. With Hadoop, RWE groups are combining key data sources, including claims, prescriptions, electronic medical records, HIE, and social media, to obtain a full view of RWE. With big data analytics in the pharmaceutical industry, analysts are unlocking real insights and delivering advanced insights via cost-effective and familiar tools such as SAS® ,R®, TIBCO™ Spotfire®, or Tableau®. RWE through Hadoop delivers value with optimal health resource utilization across different patient cohorts, a holistic view of cost/quality tradeoffs, analysis of treatment pathways, competitive pricing studies, concomitant medication analysis, clinical trial targeting based on geographic & demographic prevalence of disease, prioritization of pipelined drug candidates, metrics for performance-based pricing contracts, drug adherence studies, and permanent data storage for compliance audits.

Perpetual Access to Raw Data from Prior Research

HDP Uses Real-World Data to Deliver Real-World Evidence
Real-World Evidence (RWE) promises to quantify improvements to health outcomes and treatments, but this data must be available at scale. High data storage and processing costs, challenges with merging structured and unstructured data, and an over-reliance on informatics resources for analysis-ready data have all slowed the evolution of RWE. With Hadoop, RWE groups are combining key data sources, including claims, prescriptions, electronic medical records, HIE, and social media, to obtain a full view of RWE. Analysts are unlocking real insights and delivering advanced analytic insights via cost-effective and familiar tools such as SAS:registered: ,R:registered:, TIBCO:tm: Spotfire:registered:, or Tableau:registered:. RWE through Hadoop delivers value with optimal health resource utilization across different patient cohorts, a holistic view of cost/quality tradeoffs, analysis of treatment pathways, competitive pricing studies, concomitant medication analysis, clinical trial targeting based on geographic & demographic prevalence of disease, prioritization of pipelined drug candidates, metrics for performance-based pricing contracts, drug adherence studies, and permanent data storage for compliance audits.

Domain where Hadoop can be used: OIL AND GAS

Back To Page

Accelerate Innovation with Well Log Analytics (aka LAS Analytics)

Large, complex datasets and rigid data models limit the pace of innovation for exploration and production, because they require petrophysicists and geoscientists to work with siloed, complex datasets that require a manual quality control (QC) process. LAS log analytics with HDP big data analytics for oil and gas allows scientists to ingest and query their disparate LAS data for use in predictive models. They can do this while leveraging existing statistical tools such as SAS or R to build new models and then rapidly iterate them with billions of measurements. Combining LAS data with production, lease, and treatment data can increase production and margins. Dynamic well logs normalize and merge 100s or 1000s of LAS files, providing a single view of well log curves, presented as new LAS files or images. With HDP, those consolidated logs also include much of the sensor data that used to be “out of normal range” because of anomalous readings from power spikes, calibration errors, and other exceptions. With HDP, an automated QC process can ingest all the data (good and bad) then scrub it to eliminate the anomalous readings and present a clear, single view of the data.

Define Operational Set Points for Each Well & Receive Alerts on Deviations

After identifying the ideal operating parameters (e.g. pump rates or fluid temperatures) that produce oil and gas at the highest margins, that information can go into a set point playbook. Maintaining the best set points for a well in real-time is a job for Apache Storm’s fault-tolerant, real-time oil and gas predictive analytics and alerts. Storm running in Hadoop can monitor variables like pump pressures, RPMs, flow rates, and temperatures, and then take corrective action if any of these set points deviate from pre-determined ranges. This data-rich framework helps the well operator save money and adjust operations as conditions change.

Optimize Lease Bidding with Reliable Yield Predictions

Oil and gas companies bid for multi-year leases to exploration and drilling rights on federal or private land. The price paid for the lease is the known present cost paid to access a future, unpredictable stream of hydrocarbons. The well lessor can outbid his competitors by reducing the uncertainty around that future benefit and more accurately predicting the well’s yield. Apache Hadoop can provide this competitive edge by efficiently storing image files, sensor data and seismic measurements. This adds missing context to any third-party survey of the tract open for bidding. The company that possesses that unique information with predictive analytics can now pass on a lease that they might otherwise have pursued, or they can find “diamonds in the rough” and lease those at a discount.

Repair Equipment Preventatively with Targeted Maintenance

Traditionally, operators gathered data on the status of pumps and wells through physical inspections (often in remote locations). This meant that inspection data was sparse and difficult to access, particularly considering the high value of the equipment in question and the potential health and safety impacts of accidents. Now, oil and gas IoT sensor data can stream into Hadoop from pumps, wells and other equipment much more frequently—and at lower cost—than collecting the same data manually. This helps guide skilled workers to do what sensors cannot: repair or replace machines. The machine data can be enriched with other data streams on weather, seismic activity or social media sentiment, to paint a more complete picture of what’s happening in the field. Algorithms then parse that large, multifaceted data set in Hadoop to discover subtle patterns and compare expected with actual outcomes. Did a piece of equipment fail sooner than expected, and if so, what similar gear might be at risk of doing the same? Data-driven, preventative upkeep keeps equipment running with less risk of accident and lower maintenance costs.

Slow Decline Curves with Production Parameter Optimization

Oil companies need to manage the decline in production from their existing wells, since new discoveries are harder and harder to come by. Decline Curve Analysis (DCA) uses past production from a well to estimate future output. However, historic data usually shows constant production rates, whereas a well’s decline towards the end of its life follows a non-linear pattern—it usually declines more quickly as it depletes. When it comes to a well near the end of its life, past is not prologue. Production parameter optimization is intelligent management of the parameters that maximize a well’s useful life, such as pressures, flow rates, and thermal characteristics of injected fluid mixtures. Machine learning algorithms can analyze massive volumes of sensor data from multiple wells to determine the best combination of these controllable parameters. HDP’s powerful capabilities for data discovery and subsequent big data analytics for oil and gas analysis can help the well’s owner or lessee make the most of that resource.

Domain where Hadoop can be used: ENERGY

Back To Page

Smart Meter Data Analytics Improves Grid Reliability

Modern utility companies need to capture, transmit, analyze and store smart meter data in order to meet their business objectives associated with installing Advanced Metering Infrastructures. But their data architectures were designed during a simpler time when a monthly measurement was the norm. Smart meters collect information multiple times every hour, creating a vast, constant, and rich stream of data that utility companies are ill-equipped to process and store efficiently on legacy database platforms.

With 100% open-source Connected Data Platforms from Hortonworks, utilities enhance their grid visibility by orders of magnitude. Hortonworks’ energy big data management solution helps them monitor their data-in-motion from operated assets in real-time and compare that to deep historical analysis on past trends. That data discovery powers actionable intelligence for remote operations support, and also delivers real-time insights to: increase grid reliability, balance loads, reduce outages, and detect fraud.

A Single View of Assets to Optimize Grid Operations

Historically, power and utility companies’ operational technology (OT) and information technology (IT) systems have been developed, maintained, and used by siloed personnel within the organization. This reality has hindered cross-company collaboration and data visibility across business units, resulting in higher operating and energy costs, prolonged outages, inefficient operations, and poor customer service.

From the beginning, Apache™ Hadoop® was architected to combine data from many different sources with highly variable formats. Apache NiFi identifies all of those sources and moves them to a central location for storage and analysis (both in real-time and batch). Now Hortonworks offers both of those technologies in an integrated set of solutions for utilities data management and analytics. Connected Data Platforms integrate data from siloed operational, IT, and external systems, enabling OT/IT convergence to create new dashboards for a single view of assets. This cross-company visibility reduces downtime and optimizes grid operations, potentially saving millions of dollars.

Predictive Equipment Maintenance to Prevent Blackouts

Traditionally, operators gathered data on the health of generation, transmission, distribution and metering equipment through physical inspections. This meant that inspection data was sparse and difficult to access, particularly considering the high value of the hardware in question and the potential health, safety, and convenience impacts of equipment malfunctions.

Predictive maintenance helps utilities determine the condition of in-service equipment and then predict when maintenance should be performed. Rather than sending a maintenance truck based on the time of year, utilities now send them based on the actual need for repair. Hortonworks enables operators achieve energy efficiency from internet of things data by reducing failures and lower costs associated with routine or time-based preventive maintenance.

Single View of the Household for World-Class Customer Service

Utility companies built legacy data systems with a one-to-one relationship between the end application and storage platform. For example, the billing team manages a payment system with its database, the customer care team stores call logs in a CRM system, and the field service team stores data on service trucks and work orders.

Hortonworks’ data management and analytics solutions for utilities has helped some of the largest companies create a single view of their data and uncover value that might have been within reach, but scattered across multiple interactions, channels, groups and platforms. With that single view, they create customer personas, rank them by usage, optimize service calls, reduce churn and adjust target marketing for offers on value-added services like budget billing.

Energy Trading Intelligence, One Step Ahead of the Markets

Wholesale energy market participants face similar data challenges as utility operators, with an even more diverse set and volume of data sources that need to be collected, processed, stored, integrated and analyzed in order to reduce risk. Sources include sensor data from operated assets, market and exchange data, ERP data, data from trade and risk management platforms, and other internal and external sources.

Real-time trading solutions powered by Hortonworks enable energy traders to react instantaneously to market opportunities, without exposing their organizations to undue legal or financial risk. For example, one customer leverages our Connected Data Platforms to ingest, process, and analyze real-time electricity market data from a commodity exchange data service to enrich data in their existing trading platforms. This improves forecasting, allows them to identify market irregularities, and helps detect fraudulent trading practices.

Domain where Hadoop can be used: MANUFACTURING

Back To Page

Assure Just-In-Time Delivery of Raw Materials

Manufacturers want to minimize the inventory that they keep on hand and prefer just-in-time delivery of raw materials. On the other hand, stock-outs can cause harmful production delays. Sensors, and RFID tags and IoT in manufacturing reduce the cost of capturing supply chain data, but this creates a large, ongoing flow of data. Hadoop can store this unstructured data at a relatively low cost. That means that manufacturers have more visibility into the history of their supply chains and they are able to see large patterns that might be invisible in only a few months of data. This intelligence can give manufacturers greater lead-time to adjust to supply chain disruptions. It also allows them the connected factory to reduce supply chain costs and improve margins on the finished product.

Control Quality with Real-Time & Historical Assembly Line Data

High-tech manufacturers use sensors to capture data at critical steps in the manufacturing process. This data is useful at the time of manufacture, to detect problems while they are occurring. However, some subtle problems—the “unknown unknowns”—may not be detected at time of manufacture. Nevertheless, those may lead to higher rates of malfunction after the product is purchased. When a product is returned with problems, the manufacturer can do forensic tests on the product and combine the forensic data with the original sensor data from when the product was manufactured. This big data in manufacturing adds added visibility, across a large number of products, helps the manufacturer improve the process and products to levels not possible in a data-scarce environment.

Avoid Stoppages with Proactive Equipment Maintenance

Today’s manufacturing workflows involve sophisticated machines coordinated across pre-defined, precise steps. One machine malfunction can stop the production line. Premature maintenance has a cost; there is an optimal schedule for maintenance and repairs: not too early, not too late. Machine learning algorithms can compare maintenance events and machine data for each piece of equipment to its history of malfunctions. These algorithms can derive optimal maintenance schedules, based on real-time information and historical data. This The use of manufacturing predictive analytics can help maximize equipment utilization, minimize P&E expense, and avoid surprise work stoppages.

Increase Yields in Drug Manufacturing

Biopharma manufacturing requires careful monitoring and control of environment conditions. The goal of any production run is to maximize First Time Yield (FTY), which is a measure of the number of products that are made correctly the first time they come through the production line. Every percentage of increase in FTY represents a significant reduction in the costs of production. FTY improvements are often blocked by poor visibility into operations. Sensors can provide raw data for improving that visibility, if the sensor data can be integrated with other existing data stores. A Hadoop data lake makes this integration easier, because Hadoop does not require an a priori schema prior to ingest. Also, Hadoop’s lower cost of storage means that a cluster can store more data, of more formats, for longer for discovering new relationships in the data. Read about how Merck Research Laboratories optimized pharmaceutical manufacturing with Hortonworks Data Platform.

Crowdsource Quality Assurance

Thoroughly tested products still have post-sale problems. Customers may not report problems to the manufacturer, but still complain about the product to their friends and family on social media. This social stream of data on product issues can augment product feedback from traditional support channels. Hadoop stores huge volumes of social media sentiment data. Manufacturers can mine this data for early signals on how a product holds up throughout its lifecycle. This ability to learn about issues quickly and take early action to protect a product’s reputation is powerful for winning and maintaining customer loyalty.

Domain where Hadoop can be used:INSURANCE

Build a 360° View of the Customer

Carriers interact with customers across multiple channels, yet customer interaction, policy and claims data is often isolated in data silos. Few insurance carriers can accurately correlate acquisition, cross-sell or upsell success with either their marketing campaigns or customer online browsing behavior. Collecting and managing data from insurance IOT devices, Apache Hadoop gives the insurance enterprise a 360° view of customer behavior. It lets them store data longer and identify distinct phases in their customers’ lifecycles. Better insurance predictive analytics helps them more efficiently acquire, grow and retain the best customers.

Boost Agent Productivity with a Unified Agent Portal

Many carriers sell policies through agents. To prepare for sales calls (or to answer questions from prospects during those calls) those agents may need to look up details across multiple, disjointed platforms or applications. This takes time and decreases sales velocity. Unlike legacy data platforms, HDP stores data from many sources including insurance IOT, in a “data lake”. This permits a single lookup, without requiring multiple individual queries across different unrelated storage platforms. Agents prepare themselves more thoroughly, and they can make more calls over a given time period, helping grow revenue. Insurance companies can also use the same type of single view to understand which agents are most productive selling their products—offering incentives that promote top performers or de-certifying the chronically unproductive.

Create a High-Speed Cache for Processing Application Documents

Once customers agree to buy a new policy, the agent and/or underwriter still needs to process the application documents. This can be a lengthy manual process that causes leakage. Speed is important, but so is accuracy. One Hortonworks subscriber in the insurance industry built an enterprise document cache on HDP. Apache HBase caches the post-transaction documentation, with meta-tags that speed up processing. And because HDP’s YARN-based architecture supports multi-tenant processing on the same data set, document tracking does not slow down risk assessment or other analytics required before initiating coverage. Efficient document processing reduces costs and improves agent and underwriter productivity.

Detect Fraud

Insurance fraud is a major challenge in the industry. According to the FBI, “The total cost of insurance fraud (non-health insurance) is estimated to be more than $40 billion per year. That means Insurance Fraud costs the average U.S. family between $400 and $700 per year in the form of increased premiums.” Because there are more than 7,000 insurance companies that collect more than $1 trillion in premiums every year, criminals have a large, lucrative target. They can easily hide their tracks as they perpetrate schemes like premium diversion, fee churning, asset diversion or workers’ compensation fraud. One of the largest insurers in the United States uses HDP for machine-learning and predictive modeling that employs rules-based flags on streaming data to catch more fraudulent or invalid claims. As claims data flows into the system, real-time alerts help special investigation and claims analysts prioritize their investigations of claims with the highest likelihood of fraud.

Launch Risk-Reduction Services

Insurance companies understand risk and—as in other industries—they are moving from reactive to proactive uses of their data. Any claims adjuster has seen accidents, fires or injuries that could’ve been foreseen and maybe prevented, drawing conclusions like: “He shouldn’t have been out driving in that weather,” or “Those wires were long past their replacement age.” Now with insurance predictive analytics, insurers are capturing and sharing that insight with their customers before the losses occur. With these risk-reduction and prevention services, carriers share real-time analytics with policyholders, so they can prevent mishaps. For example, they can establish algorithms to identify emerging high-risk phenomena having to do with foul weather, disease epidemics, or equipment recalls—and provide timely alerts that help their customers protect themselves and their property. One Hortonworks customer that offers car insurance is working on real-time alerts that will notify drivers when a strong storm will affect a particular stretch of road and then also suggest less-risky alternate routes.

Price Risk with Empirical Sensor Data

Moral hazard describes the phenomena of one person taking more risk because someone else bares the burden of that risk. When a company offers an auto insurance policy, they face moral hazard because of information asymmetry—policyholders know more about how they actually drive than does the carrier. Drivers may drive a bit faster or watch the road a little less closely because they know that they are covered in the event of a collision. Carriers set prices to cover that moral hazard, and so the safer drivers end up subsidizing those who take more risks on the road. Usage-based insurance (UBI) has the potential to reduce information asymmetry and moral hazard by rewarding safe drivers for their good behavior. A major insurer runs its UBI products with insurance iot and telematic sensor data stored in HDP. Prior non-Hadoop processing captured only a subset of UBI data streaming from sensors in policyholders’ cars and extract-transform-load (ETL) processes delayed availability of that data until the week after capture. With HDP, the company captures and stores all driving data from customers that opt in to UBI, processes the larger dataset in half the time, and uses predictive modeling to reward those drivers for how they actually drive rather than guessing on how they might drive based only on their age, type of car, location and prior history.

Domain where Hadoop can be used: TELECOM


Analyze call detail records (CDRs)


Telcos perform forensics on dropped calls and poor sound quality, but call detail records flow in at a rate of millions per second. This high volume makes pattern recognition and root cause analysis difficult, and often those need to happen in real-time, with a customer waiting for answers. Delay causes attrition and harms servicing margins.
Hortonworks DataFlow (HDF™) can ingest millions of CDRs per second into Hortonworks Data Platform, where Apache™ Storm or Apache Spark™ can process them in real-time to identify troubling patterns. HDP facilitates long-term data retention for root cause analysis, even years after the first issue. This CDR analysis can be used to continuously improve call quality, customer satisfaction and servicing margins.

Service equipment proactively


Transmission towers and their related connections form the spinal chord of a telecommunications network. Failure of a transmission tower can cause service degradation. Replacement of equipment is usually more expensive than repair. There exists an optimal schedule for maintenance: not too early, nor too late.
HDP stores unstructured, streaming, sensor data from the network. Telcos can derive optimal maintenance schedules by comparing real-time information with historical data. Machine learning algorithms can reduce both maintenance costs and service disruptions by fixing equipment before it breaks.

Rationalize infrastructure investments


Telecom marketing and capacity planning are correlated. Consumption of bandwidth and services can be out of sync with plans for new towers and transmission lines. This mismatch between infrastructure investments and the actual return on investment puts revenue at risk.
Network log data helps telcos understand service consumption in a particular state, county or neighborhood. They can then analyze network loads more intelligently (with data stretching over longer periods of time) and plan infrastructure investments with more precision and confidence.

Recommend next product to buy (NPTB)


Telecom product portfolios are complex. Many cross-sell opportunities exist for the installed customer base, and sales associates use in-person or phone conversations to guess about NPTB recommendations, with little data to support their recommendations.
HDP gives a telco the ability to make confident NPTB recommendations, based on data from all of its customers. Confident NPTB recommendations empower sales associates (or self service) and improve customer interactions. An Apache Hadoop® data lake reduces sales friction and creates NPTB competitive advantage similar to Amazon’s advantage in eCommerce.

Allocate bandwidth in real time


Certain applications hog bandwidth and can reduce service quality for others accessing the network. Network administrators cannot foresee the launch of new hyper-popular apps that cause spikes in bandwidth consumption and then slow performance. Operators must respond to bandwidth spikes quickly, to reallocate resources and maintain SLAs.
Streaming data through HDF into HDP for real-time analysis can help network operators visualize spikes in call center data and nimbly throttle bandwidth. Text-based sentiment analysis on call center notes can also help understand how these spikes impact customer experience. This insight helps maintain service quality and customer satisfaction, and also informs strategic planning to build smarter networks.

Develop new products


Mobile devices produce huge amounts of data about how, why, when and where they are used. This data is extremely valuable for product managers, but its volume and variety make it difficult to ingest, store and analyze at scale. Not all data is stored for conversion into business insight. Even the data that is stored may not be retained for its entire useful life.
Apache Hadoop can put rich product-use data in the hands of product managers, which speeds product innovation. It can capture product insight specific to local geographies and customer segments. Immediate big data feedback on product launches allows PMs to rescue failures and maximize blockbusters.

Domain where Hadoop can be used:HEALTHCARE

Access genomic data for new cancer treatments

If we read that a given drug is “40% effective in treating cancer,” another interpretation could be that the drug is 100% effective for patients with a certain genetic profile. However, genomic data is Big Data. The data in a single human genome includes approximately 20,000 genes. Stored in traditional data platforms, this is the equivalent of several hundred gigabytes. Combining each genome with one million variable DNA locations produces the equivalent of about 20 billion rows of data per person.

Researchers at major universities and teaching hospitals are performing big data analytics in genomics with Hortonworks Data Platform as the cost-effective, reliable platform for storing genomic data and combining that with other data on demographics, trial outcomes, and real-time patient responses. They are adopting Hortonworks DataFlow to stream that data into HDP for real-time decisions and long-term cohort analyses. Connected Data Platforms help those doctors learn which drugs and treatments work best for groups of patients across the genetic spectrum.

Monitor patient vitals in real time

In a typical hospital setting, nurses do rounds and manually monitor patient vital signs. They may visit each bed every few hours to measure and record vital signs but the patient’s condition may decline between the time of scheduled visits. This means that caregivers often respond to problems reactively, in situations where arriving earlier may have made a huge difference in the patient’s wellbeing.

New wireless sensors can capture and transmit patient vitals far more frequently than human beings can visit the bedside, and these measurements can stream into a Hadoop cluster. Caregivers can use these signals for real-time alerts to respond more promptly to unexpected changes. HDP uses this data accumulated over time for healthcare predictive analytics, feeding algorithms that proactively help predict the likelihood of an emergency even before it could be detected with a bedside visit.

Reduce cardiac re-admittance rates

Patients with heart disease can be closely monitored while they are in a hospital, but when those patients go home, they may skip their medications or ignore dietary and self-care instructions given by their doctor when they left the hospital.

Congestive heart failure causes fluid retention, which leads to weight gain. In one innovative program at UC Irvine Health, patients could return home with a wireless scale and weigh themselves at regular intervals. Algorithms running in Hortonworks’ healthcare predictive analytics determined unsafe weight gain thresholds and alerted a physician to see the patient proactively, before an emergency re-admittance was necessary.

Machine learning to screen for autism with in-home testing

Autism spectrum disorders affect 1 in 100 children at an annual cost estimated at more than $100 billion. The condition can be detected through behavior at eighteen months, but more than 1 in 4 cases are still undiagnosed at 8 years of age. A small number of clinical testing facilities are oversubscribed, with long wait lists. The most common diagnostic test typically takes 2.5 hours to administer and score.

Dr. Dennis Wall is Director of the Computational Biology Initiative at the Harvard Medical School. In this presentation, he describes a process his team developed for low-cost, mobile screening for autism. It takes less than five minutes and relies on the ability to store large volumes of semi-structured data from brief in-home tests administered and submitted by parents. Wall’s lab also used Facebook to capture user-reported information on autism.

Artificial intelligence running on those huge data sets helps maximize efficiency of diagnosis without loss of accuracy. This approach, in combination with data storage on a Hadoop cluster, can be used for other innovative machine learning diagnostic processes.

Domain where Hadoop can be used: FINANCIAL SERVICES

Screen New Account Applications for Risk of Default


Every day, large retail banks take thousands of applications for new checking and savings accounts. Bankers that accept these applications consult 3rd-party risk scoring services before opening an account. They can (and do) override do-not-open recommendations for applicants with poor banking histories. Many of these high-risk accounts overdraw and charge-off due to mismanagement or fraud, costing banks millions of dollars in losses. Some of this cost is passed on to the customers who responsibly manage their accounts.

Hortonworks Data Platform can store and analyze multiple data streams and help regional bank managers apply predictive analytics to control new financial account risks in their branches. They can match banker decisions with the risk information presented at the time of decision, to control risk by sanctioning individuals, updating policies, and identifying patterns of fraud. Over time, the accumulated data informs algorithms that may detect subtle, high-risk behavior patterns unseen by the bank’s risk analysts.

Monetize Anonymous Banking Data in Secondary Markets


Banks possess massive amounts of operational, transactional and balance data that holds information about macro-economic trends. This information can be valuable for investors and policy-makers outside of the banks, but regulations and internal policies require that these uses strictly protect the anonymity of bank customers.

Retail banks have turned to Hortonworks Data Platform as a common cross-company data lake for data from different LOBs: mortgage, consumer banking, personal credit, wholesale and treasury banking. Both internal managers and consumers in the secondary market derive value from the data. A single point of data management allows the bank to operationalize security and privacy measures such as de-identification, masking, encryption, and user authentication.

Maintain Sub-Second SLAs with a Hadoop “Ticker Plant”


Ticker plants collect and process massive data streams on stock trades, displaying prices for traders and feeding computerized trading systems fast enough to capture opportunities in seconds. Applying predictive analytics to the financial markets is useful for making real-time decisions, and years of historical market data can also be stored for long-term analysis of market trends.

One Hortonworks customer re-architected its ticker plant with HDP as its cornerstone. Before Hadoop, the ticker plant was unable to hold more than ten years of trading data. Now every day gigabytes of data flow in from thousands of server log feeds. This data is queried more than thirty thousand times per second, and Apache HBase enables super-fast queries that meet the client’s SLA targets. All of this, and also a retention horizon extended beyond ten years.

Analyze Trading Logs to Detect Money Laundering


Another Hortonworks customer that provides investment services processes fifteen million transactions and three hundred thousand trades every day. Because of storage limitations, the company used to archive historical trading data, which limited that data’s availability. In the near term, each day’s trading data was not available for risk analysis until after close of business. This created a window of time with unacceptable risk exposure to money laundering or rogue trading.

Now Hortonworks Data Platform supports their AML software and accelerates the firm’s speed-to-analytics and also extends its data retention timeline. A shared data repository across multiple LOBs provides more visibility into all trading activities. The trading risk group accesses this shared data lake to processes more position, execution and balance data. They can do this analysis on data from the current workday, and it is highly available for at least five years—much longer than before.

Domain where Hadoop can be used: ADVERTISING


Mine POS data to identify high-value shoppers


One marketing analytics company specializes in gathering insight at the checkout counter, across many grocers and drug stores. They mine this sales information for basket analysis, price sensitivity, and demand forecasts.
Interactive query with the Stinger Initiative and Apache Hive running on YARN in hadoop helped the company rapidly process terabytes of data to keep pace with a market that changes by the day. Manufacturers, retailers, and ad agencies use the combined analysis to position their brands or improve their retail experiences, particularly for high-value customers.

Target ads to customers in specific cultural or linguistic segments


Hortonworks customer Luminar is the leading big data analytics and modeling provider uniquely focused on delivering actionable advertising insights on U.S. Latino consumers. Luminar wanted to move beyond sample data on Latino consumers living in the United States, towards empirical analysis of actual data on all US Latinos. Rather than store only some transactions from one or two sources, they wanted to acquire and save as many transactions as possible from as many different sources as possible.
Now hadoop interacts easily with other components of Luminar’s data and business intelligence ecosystem: Amazon Cloud, R, Talend and Tableau. The company has increased ingest of transaction data from 300 sources to thousands, up from 2 to 15 terabytes per month. Before, it took Luminar days to ingest and join a new set of raw data, now it takes only hours, even with eight times more data than before. Luminar uses their actionable intelligence to craft marketing strategies for CPG and entertainment companies that want to focus on the US Latino population.

Syndicate videos according to behavior, demographics and channel


A major omni-media company specializes in home improvement and DIY content distributed across television, digital, mobile and publishing channels. One of its divisions is focused on delivering online video ads.
Both content syndicators and publishers want to make sure that video content reaches the right audience. The company analyzes clickstream data stored on hadoop to analyze audiences then feed insight into a recommendation engine that improves advertising click-through.

ETL toy market research data for longer retention and deeper insight


A leading consumer research firm provides consumer intelligence to the toy industry. The market is in a state of flux; there are more new digital options and “real world” forms of children’s play than ever before. The company helps its clients keep pace by delivering weekly point-of-sale (POS) tracking information for competitive insight on toy sales trends. They cover all the major toy retailers for a single view of the marketplace.
The company chose hadoop to offload much of its data from a more expensive platform, with expected savings of more than $1 million annually. The improved economics allow the company to retain data longer and identify long-range, strategic opportunities for growth. This helps its clients in the toy industry partner more closely with retailers.

Optimize online ad placement for retail websites


An advertising Hortonworks customer provides web analytics services to some of the world’s largest retail websites. For their largest customer, clickstream data pours in at the rate of hundreds of megabytes per hour, which adds up to billions of rows per month. The agency analyzes each ad’s placement and determines click-through and conversion rates. When impression files and click files were stored in a relational database, the agency had no way to intelligently connect impressions to clicks–so they had to make too many guesses.
Now HDP® replaces that guess work with empirical science and confident analysis by week, by day or by hour. The agency can also filter by the consumer’s OS, browser, device and geographical location. With Hortonworks Data Platform’s economies of scale, data storage costs are significantly lower than before, and data can be retained for longer. Now the agency and its clients all look forward to looking back on years (not weeks) of clickstream data.
The agency’s retail customers can now tell if consumers are clicking on their website while standing in one of their stores. This provides valuable insight to manage “showrooming” behavior where customers visit a store to touch a product and then drive home to buy it online. Retailers can address showrooming without slashing prices, and data in HDP reveals specific tactics for doing so.

Kafka Architecture

Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you t...