Updates from December, 2014 Toggle Comment Threads | Keyboard Shortcuts

  • admin 9:54 am on December 30, 2014 Permalink
    Tags: , , Liquid, , Trillion   

    Tap into a $5 Trillion Market with Liquid Information 

    While big data dominates the headlines, open data is quietly becoming a force in its own right—a force that could reach $ 5 trillion annually in potential value. Open data, also known as “liquid information,” is data available for anyone to use, reuse and distribute for free. It can be defined as machine-readable and accessible through an application programming interface (API).

    Performing analytics on this information lets organizations identify anomalies in costs and variations in performance that can lead to new and improved products and operations. For example, one company integrated 30 years of weather data, 60 years of crop information and 14 terabytes of soil data—free government information—to help farmers around the globe make data-based decisions for their operations.

    With approximately 60 nations releasing 1 million data sets, companies can capitalize on the information to improve benchmarking, planning and other business functions. Organizations can also release their own data, which can be shared and integrated with other information to identify new revenue sources, support new business practices and promote innovation.

    Brett Martin
    Editor in Chief
    Teradata Magazine

     

     

    Teradata Blogs Feed

     
  • admin 9:52 am on December 24, 2014 Permalink
    Tags: ,   

    Data-Driven Success 

    There’s no question about it—when it comes to putting data-driven innovation to work across the business, Teradata customers do it best.

    Telefónica U.K., one of the largest telecommunications companies in the U.K., is a good example. The company is on a mission to expand its customers’ mobile capabilities and provide a better experience across all channels. By integrating geospatial and customer data, it’s able to effectively communicate with customers in real time and make better data-driven decisions.

    Raiffeisen Bank International, which operates in 17 countries across Central and Eastern Europe, also uses data to spark innovation. The bank embarked on a journey to build an enterprise-wide data warehouse based on a harmonized risk and financial data model. Now that risk and finance speak the same language, everyone can share and benefit from the same information, which leads to deeper insights.

    This type of innovation can deliver myriad benefits for any organization in any industry, including improved products and services, enhanced business processes, better business models, and huge revenue opportunities.

    Carly Schramm
    Assistant Editor
    Teradata Magazine

     

     

    Teradata Blogs Feed

     
  • admin 9:46 am on December 24, 2014 Permalink
    Tags: , Direct, , , , ,   

    Teradata and Qlik Direct Discovery Hybrid In-Memory and In Database 


    Teradata Brochures

     
  • admin 9:51 am on December 23, 2014 Permalink
    Tags: , , ,   

    A Single Data Fabric to Tame Big Data 

    By now you probably know I’m an advocate of the data lake approach for expanding the possibilities around big data.  But, I’ve also made the point that data inside the lake isn’t much use if it isn’t accessible and connected to other data sources.  You can dump all the information you want in a data lake, but co-location is not good enough.  You also need coherence!  All the capacity and data architecture in the world will get you little more than chaos without an orchestration layer to “weave” data together across multiple systems and processes in usable ways.

    The textile analogy is not a trivial one.  The idea of tightly woven nodes of connection, redistribution and communication points is behind the revolutionary “data fabric” concept for powerful, diverse and closely interconnected computing assets that are scalable and can be dynamically reconfigured as an agile analytic environment.  Throughout, the ability to orchestrate these assets makes the difference between a simple tangle of disjointed information and a cohesive data fabric that you can leverage for insight and value.

    design template with underwater part and sunset skylight splitte

    We took a big step toward fabric computing at scale with the April 2014 release of Teradata QueryGrid, a set of intelligent connectors and product capabilities that allows you to submit a single query across your entire multi-system architecture, sidestepping limitations of data movement and system boundaries.  Now I’m pleased to say we’ve added our new Data Fabric enabled by QueryGrid that further enhance orchestration of multiple enterprise systems to the point where they essentially behave as one and turn data fabric.

    We announced these QueryGrid enhancements at our Partners Conference & Expo in Nashville, Tennessee, and you can read much more about this in our news release. But even a cursory look at these capabilities makes the value clear. Teradata QueryGrid now enables a Data Fabric with the ability to scale economically with architectural flexibility across the customer’s data and analytical environment.  These various layers of capability, all using the same familiar SQL interface to submit powerful queries to the Teradata Database, Aster Data, Hadoop or other data management platforms. The most powerful bond is between two Teradata Databases and between Teradata Database and Aster database. The next layer binds the Teradata Database and Hadoop systems within the Teradata Unified Data Architecture (UDA).

    A third, more universal set of features and capabilities comes in the fabric extensions beyond the Teradata UDA to additional relational databases, NoSQL databases and other data repositories. An Adaptive Optimizer capability ensures optimal query performance across even the most diverse multi-system, multi-vendor and mixed technology ecosystems. Taken together, these improvements help Teradata QueryGrid turn a best-of-breed environment into what I call an Orchestrated Analytical Ecosystem. And a lot of problems that come inherent with the data lake are solved in the process.

    It’s not hard to think of the use cases that are emerging as we continue to bring these capabilities to market over the next few months. Airlines can now use a single layer of analysis to examine sentiment expressed in social media data on Hadoop alongside high-value customer data from transaction history details located in the data warehouse. In a single step, a manufacturer can leverage serial numbers from the data warehouse along with sensor data in a MongoDB database to examine reports of product failures. And, insurers can avoid duplicating data, while using multiple systems to look at both recent and historical information when examining claims history for a particular medical condition over the period of a decade, or more.

    I think it’s cause for celebration that we’re at a point in the analytic revolution where the data fabric has moved from concept to reality, as a means for enterprises to weave together a truly cohesive, flexible, efficient and economical framework for building and integrating a data lake. These latest Teradata QueryGrid enhancements give companies the flexibility to pick their file systems, operating systems, data types, analytic engines and system design characteristics to reap the most insight and value from their complex and customized analytic environments.  This lets everyone focus even more on getting answers to business questions, not the headaches of underlying IT process or infrastructure.

    Scott Gnau

    Teradata Blogs Feed

     
  • admin 9:45 am on December 23, 2014 Permalink
    Tags: ,   

    DEMAND FORECASTING 


    Teradata Brochures

     
  • admin 9:51 am on December 22, 2014 Permalink
    Tags: , bestselling, , read   

    Is a bestselling book the best read? 

    When buying a book, whether as a present or for a holiday read it’s tempting to use the bestseller lists as a guide as to what’s the best read. However buying just based on titles with highest unit sales doesn’t always provide the most enjoyable and entertaining read. The eBook has opened up a whole new insight into reader engagement – not just what people buy, but whether they read what they buy, and whether they finish the book or abandon it.

    This week has seen further analysis that the bestseller list may not necessarily identify a book that has most appeal, by using data aggregated from eBook readers to highlight some interesting findings. In the Summer the Wall Street Journal reported on a research that used data from Kindle highlights to estimate how far through a book readers typically reached.  So if such highlights tended to diminish after 2/3 of the way through a book title this provided a reasonable guide to the typical completion point for a given title.

    Teradata’s focus on the benefits in always considering the lowest granularity of data is highlighted by a more recent study based on data from Kobo. Kobo looks at reader engagement using open rates, completion rates, speed of reading, number of sessions: a raft of data that gives publishers much better insight into the trends in publishing: who and what is hot, who’s not, and predictions of how this will change over time. This research showed that the actual best-read book list differed significantly to the top selling list – much to the delight of Casey Kelleher who’s self-published thriller “Rotten To The Core” had, at 83%, the highest completion rate, and yet didn’t appear in the best-selling list.

    So before you select your next read, or a gift book, you might want to look beyond the best-sellers to ensure your selection is as entertaining as your hope it to be; that it is opened and gets read from end-to-end.

    Teradata Blogs Feed

     
  • admin 9:46 am on December 22, 2014 Permalink
    Tags: , , , , , ,   

    The Digital Trends for Data-Driven Markets in 2015 & Beyond 

    Teradata Press Mentions

     
  • admin 9:51 am on December 21, 2014 Permalink
    Tags: , , , ,   

    Graph Processing Inside an Analytic DBMS 

    Although the Bulk Synchronous Parallel (BSP) model for scalable parallel processing was invented by Leslie Valiant in the 1980s (and was cited as part of the reason for Valiant’s recent Turing award), it became a popular model for scalable processing of graph data in 2010 when Grzegorz Malewicz et. al. from Google published their seminal paper on Pregel in SIGMOD 2010 (http://dl.acm.org/citation.cfm?id=1807184).

    Scaling the processing of graph data (such as social network data, telecom data, and public health disease outbreak data) is notoriously challenging. Unlike set-oriented relational database processing which is fairly straightforward to parallelize, graph processing typically requires traversing through paths in a graph. Given the high branching factors of most real-world graphs, this results in non-local access patterns, and thus a high dependence on vertices and edges being stored or processed by remote processors. The reliance on remote data makes parallel processing much more difficult, as each processor may have to wait for arbitrary lengths of time for remote data.

    Pregel’s use of BSP for processing graphs does not eliminate the non-local access pattern problem. However, by eschewing asynchronous communication models that can lead to deadlock and other types of race conditions that make programming scalable algorithms extremely complex, BSP makes it much easier for programmers to reason about the semantics of their graph algorithms. In particular, the programmer of a graph algorithm only needs to focus on what happens during a “super-step”. In each super-step, each vertex, v, in the graph can apply an arbitrary function on its own local state, based on messages sent to v from any other vertex in the graph in the previous super-step. The output of this function is a new state for v (and also potentially new states for its outgoing edges), and a set of messages that can be sent to any arbitrary vertex in the graph that will be processed in the next super-step.

    In other words, processing proceeds in a BSP graph system via an alternating series of completely local function applications done in parallel on each vertex in the graph, interspersed with an arbitrary amount of message passing between vertexes. The system proceeds between these two steps “synchronously” — the next round of per-vertex function applications cannot begin until the previous round of message passing is complete. Although in distributed systems “synchronizing” is usually synonymous with “performance slow-downs due to waiting for other nodes to catch up,” these synchronization barriers are actually relatively cheap in graph processing systems since there are typically many more vertexes than there are processors, so a significant amount of work occurs per processor before synchronization, thereby allowing the cost of synchronization to be amortized by the larger amount of useful work.

    It turns out that many popular graph algorithms can be elegantly expressed in terms of these “per-vertex” functions that get applied in each super-step of the BSP algorithm. The Pregel paper itself describes the implementations of Page Rank, Shortest Paths, Bipartite Matching, and a Semi-Clustering algorithm using this model. Thus, the combination of programmer friendliness, along with the relatively small cost of the synchronization barriers, has led to the BSP model for graph processing to gain enormous popularity over the four years since the Pregel paper was published. For example, the original Pregel paper has already received close to 1000 citations, and several new graph processing systems that adopt the Pregel BSP model have been recently developed, including Apache Giraph, Apache Hama, GPS, Mizan, and Teradata Aster’s SQL-GR framework.

    However, Teradata Aster’s SQL-GR framework differs from the other BSP graph systems mentioned above in that it is implemented inside an analytical database system. This greatly increases both the power and efficiency of the system, which the rest of this blog post will discuss.

    For example, let’s say a Telecom company has a data warehouse with a “fact table” consisting of a history of call detail records (CDRs) containing the phone numbers of the originator and receiver of each call, the start time and duration of each call, and various other routing and failure information associated with each call. Furthermore, it contains various dimension tables containing customer address and billing information, router location information, etc.

    For most of what the company wants to do with this data (e.g. billing, fault analysis, statistic generation, population of customer-viewable information about their call history, etc.), the above-described relational layout with a CDR fact table and dimension tables with more detailed information is the right way to go. Dimension tables are joined with the fact table as needed, and some may even be pre-joined with the fact table if a column-oriented layout is used (since large, wide tables are not problematic for column-stores).

    However CDR data is also a gold-mine for doing customer churn analysis because it contains information about real connections between people — who is calling whom, which numbers are popular, who are the bridges between different communities of customers, etc. For this type of analysis, it is helpful to think of CDR data as a graph, where each customer (or phone number) is a vertex in the graph, and there are edges in the graph for each CDR record corresponding to calls and SMS texting between two customers (phone numbers).

    Once a graph out of the CDR data is constructed, graph algorithms that calculate the “eigen centrality” of each vertex and “betweenness” of each vertex are extremely useful to understanding which vertices are key influencers and control information flow — two characteristics that are critical to any kind of customer churn analysis. The eigen centrality of a vertex is a direct measure its relative importance within the graph — if there are many edges that connect to it, especially if those edges are of high “quality” (i.e. they originate from other highly important vertexes), then that vertex will have a high eigen centrality. For example, if many important people call Mr. Graham, the “Mr. Graham” vertex will have a high eigen centrality –and we can thus infer that he is important even if he receives very few other phone calls. Measuring the eigen centrality of a vertex is necessarily an iterative algorithm — in each iteration we get a new estimate of the importance of each vertex and can use this to refine the quality values of each outgoing edge in order to create a new importance estimate for each vertex in the next iteration.

    “Betweenness” is a measure of centrality of a vertex in a graph. It calculates the number of times any vertex is included in the shortest path between any two nodes in the graph. If this number is large, this means that the vertex often serves as a bridge between multiple clusters of nodes in the graph, and has a large amount of power in the control of information through the graph. In our example, these clusters of vertices are communities of people.

    abadi blog image Dec 16

    Centrality of a Vertex Between Clusters of Vertices

    Since both betweenness and eigen centrality are complex, iterative algorithms on potentially large graphs, BSP graph engines of the type discussed above are a great fit for performing these calculations. Trying to perform the same operations in a standard relational database system is close to impossible — each iteration would require a self-join of the table containing all the edges in the graph; but without knowing in advance how many iterations will be required, it cannot be determined how many joins to include in the SQL query. Furthermore, even if the required number of iterations could somehow be deduced in advance, trying to write these complicated graph algorithms in SQL would be highly unpleasant and nearly impossible for other humans to decipher.

    Therefore, our example telecom company is a little stuck — on one hand a relational database system is a great fit for most of what needs to be done with the data (billing, fault analysis, stat generation, etc.), but a BSP engine is the right fit for doing graph-oriented customer churn analysis. Since traditional database systems do not process graph algorithms well, the telecom company may be tempted to put the entire dataset in the BSP engine. Unfortunately, while graph engines are great for running graph algorithms, they are not optimized for traditional relational analysis, where standard star-schemas and table-oriented (especially column-oriented) analysis are known to perform extremely well. Therefore, they will see major reductions in performance on their non-graph-oriented queries, and will further have decreased options analytical tools to use in conjunction with their data, since many third-party tools are designed to interface only with relational database systems.

    Another option is to simply duplicate the same CDR data in two different systems — one relational system for doing standard analysis and one BSP graph system devoted to analyzing customer churn. Unfortunately, this option is also suboptimal. Aside from the problem of duplicating data in two systems (and the additional associated hardware, administrative, and maintenance costs), there are also a class of queries that require both relational and graph processing techniques.

    For example, let’s say the telecom company wants to calculate betweenness metrics not on the original graph, but rather on a version of the graph containing all customers in New Jersey who recently received an overage fee and whose contract is about to complete. The segmenting of the original dataset by these dimensions to get the new graph is exactly what relational database systems are designed for, while the calculation of betweenness metrics are what BSP graph systems are designed for. Performing this query in two separate systems would require the relational database system to perform the first part of the query, then export the new graph before it gets parsed and reloaded into the BSP system for the second part of the query. If further relational processing needs to be performed on betweenness metrics, another export and reload back into the relational database system will have to occur.

    Performing these “hybrid queries” in a single system is a far better solution. The entire query can be expressed as a single request, and processed entirely by one system. The SQL foundation adds easy data filtering, preparation, sorting, and reporting options to the BSP processing. For example, given the following two tables with the relevant attributes listed below:
    customer_connections — table
    source_customer_id: varchar, target_customer_id:varchar

    customers — table
    customer_id varchar, customer_location:varchar, customer_overage_fee: boolean, end_of_contract_date: date

    The query described in the previous paragraph can be expressed in Teradata Aster as:

    SELECT * FROM BETWEENNESS(
    ON (SELECT * from customer where customer_location=’New Jersey’ and
    customer_overage_fee=’t’ and (end_of_contract_date – now()::date < 60) ) as vertices
    PARTITION BY customer_id ON customer_connections as edges
    PARTITION BY source_customer_id TARGETKEY(‘target_customer_id’)
    ) order by end_of_contract_date;

    Not only does performing the query in a single system eliminate the need to export, parse, and reload data in the middle of queries, but it can also be optimized by a single optimizer that knows the statistics of the generated graphs, and can eliminate redundant operations, apply predicates and projections in the right place of query plans, etc. In fact, Teradata Aster actually has an elegant “collaborative planning” optimization technique that allows hybrid relational-graph queries to be planned and optimized in a unified framework. Since this post is already quite long, we will not discuss this optimization technique here, but I will likely describe it in detail in a future post.

    In summary, graph analysis is an increasingly important technique that data scientists will need to apply to many real-world datasets. BSP graph engines are extremely well suited to performing a large class of graph analysis queries; however, they do not replace traditional relational database systems. Since modern workloads will contain both relational and graph queries, including some queries that require both relational and graph techniques within the same query, a two system approach will lead to headaches, slowdowns, and a general lack of optimality. Thus, I believe that many relational database systems will eventually incorporate a BSP execution engine to enable hybrid workloads and analysis. Teradata Aster is at the forefront of this trend.

    daniel abadi crop BLOG bio mgmt

     

     

    Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

    Teradata Blogs Feed

     
  • admin 9:46 am on December 21, 2014 Permalink
    Tags: , Chris, , , Orchestration, Q&ampA, , , Twogood   

    Big Data Analytics Requires Orchestration, not Federation: A Q&A with Chris Twogood of Teradata 

    Teradata Press Mentions

     
  • admin 10:33 am on December 20, 2014 Permalink
    Tags: “ODTUG”, , , , , Kaleidoscope, , ,   

    Oracle Developers Tools User Group (ODTUG) “Kaleidoscope 2015” Conference 

    Teradata Corporation is a Silver Sponsor of the Oracle Developers Tools User Group (ODTUG) “Kaleidoscope 2015” Conference. Join us at the Diplomat Resort and Spa in Hollywood, Florida where Kscope15 will feature more than 300 technical sessions, five symposiums, deep dive sessions, and hands on labs over the course of five days. Teradata will be exhibiting and presenting on the topic of: “Finance, What’s Your Analytics IQ? Critical Capabilities for the Evolving CFO Organization “. ODTUG is an independent, not-for-profit, global organization whose sole purpose is to keep you on the cutting-edge of the constantly changing technology landscape.
    Teradata Events

     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel