Tagged: analysis Toggle Comment Threads | Keyboard Shortcuts

  • admin 9:51 am on November 3, 2017 Permalink
    Tags: analysis, , , ,   

    Behavioral segmentation through path analysis 

    Latest imported feed items on Analytics Matters

  • admin 10:33 am on July 18, 2017 Permalink
    Tags: analysis, , , ,   

    Introducing the Path Analysis Interface for Teradata 

    Latest imported feed items on Analytics Matters

  • admin 9:46 am on May 12, 2016 Permalink
    Tags: analysis, , , Tackles, ,   

    Analysis: Teradata Tackles Complexity with New IoT Analytics Team 

    Teradata Press Mentions

  • admin 9:49 am on August 31, 2015 Permalink
    Tags: analysis, JapanData, , ,   

    Japan-Data Analysis Using Teradata QueryGrid 

    Teradata White Papers

  • admin 9:45 am on August 27, 2015 Permalink
    Tags: analysis, , , JapanAccelerate,   

    Japan-Accelerate The Value of Big Data Analysis and Discovery 

    Teradata Brochures

  • admin 9:47 am on July 15, 2015 Permalink
    Tags: analysis, , , ,   

    Teradata’s Big Data Appliance: Analysis For Everyone 

    Teradata Press Mentions

  • admin 9:52 am on April 6, 2015 Permalink
    Tags: analysis, , , Funnel, , Playbook,   

    Funnel Analysis: an Approach from the Power Marketer Playbook 

    funnel imagePower marketers are always interested in the most effective ways to track, measure, and analyze customer experiences for more relevant engagement. I’d like to share an approach that is less known yet potentially quite powerful.

    Businesses across global markets are re-thinking data, analytics, platforms, and research methods to better understand their customers. Event analytics offers a new view of the customer, leveraging best technologies and diverse data sources, to obtain actionable insights in real time. Traditional methods help us understand consumers in terms of the following aspects: who, what, when, and where. Yet two of the most important questions for understanding consumers (“why” and “how”) are un-answered. The answers are key to obtaining business value because they can help us understand the why and how of consumers’ interactions with a company.

    Traditional approaches focus on how the customer looks to the business. For example, what do you buy? What segments are you in? When was your last visit? However, the more important question should be “how does the business look to the customer?” How do our customers experience our products and brands? How do customers feel at each touch point?

    One major advantage of event analytics over traditional methods is that it can improve our understanding of the customer’s view of the business. Traditional systems are not designed to solicit, extract and stitch together customer experience data well. Event analytics obtains information about the entire customer experience in detail, threading together many sources of information from different applications that combine to deliver the full view of customer experience.

    To conduct event analytics, businesses need to create a “customer experience universe” that stitches customers’ experiences together, allows for easy behavior pattern recognition and facilitates visualizations of customer behaviors. This universe includes social media, customer experience, marketing channels, mobile apps, and devices. Then, machine learning algorithms are used to run through all the data to identify patterns.

    Event analytics is an ecosystem that includes, for example, streaming ingestion of events, event repository, event metadata, guided user interface for business analysts and machine learning algorithms. One category of use cases is called funnel analytics which help us to understand customer behavioral patterns and what triggers their experiences.

    Funnel analysis provides visibility across a series of customer experience events that lead towards a defined goal, say, from user engagement in a mobile app to a sale in an eCommerce platform. Funnel analyses are an effective way to calculate conversion rates on specific user behaviors, yet funnel analytics can be complex due to the difficulty in source categorization, visitor identification, pathing, attribution and conversion.

    Funnels can be built using a single guided user interface without needing to write code or move data. As a result, event analytics can scale at the speed of business. It is a smart analytic approach because it helps create visibility to the path that users are most likely to follow to achieve their goals.

    The value of having this insight is of great significance since it gives marketers a deep, data-driven line of sight into the customer experience universe.

    James Semenak

    James Semenak

    James Semenak is a Principal Consultant with Teradata – known as an evangelist and architect for Event Analytics as well as Big Data Analytics and strategies.  James consults in all things related to data and analytics around the internet, and has worked with Shutterfly, Expedia, eBay Enterprise, Charles Schwab, Nokia, eBay, PayPal, Real Networks, Overstock.com, Electronic Arts, and Meredith Corp.



    The post Funnel Analysis: an Approach from the Power Marketer Playbook appeared first on Data Points.

    Teradata Blogs Feed

  • admin 9:52 am on November 15, 2014 Permalink
    Tags: analysis, , Disk, , , Vector   

    Optimizing Disk IO and Memory for Big Data Vector Analysis 

    At Yale, every spring I usually teach an advanced database systems implementation class that covers both traditional and more modern database system architectures. I often like to test my students with questions like the following:

    Let’s say the following SQL query is issued to a data warehouse for a retail store (the query requests the total revenue generated by a particular store on October 20th, 2014):
    SELECT SUM(transaction_amount)
    FROM transactions
    WHERE store_id = 37
    AND date=’2014-10-20′

    Assume that we either do not have indexes on store_id and date, or the query optimizer chooses not to use them to process this query (e.g. because the predicates are not restrictive enough). The query processor thus executes this query by scanning every record in the transactions table, performs two predicate checks (one to see if store_id is 37, and one to see if date is 2014-10-20), and if both checks pass, extracts the transaction amount and adds it to a cumulative sum.  What is the performance bottleneck of this query?

    classroom abadi blog

    I obviously have not given the students enough information about the hardware configuration of the machine (or cluster of machines) on which this query is executed, nor about the software implementation of the database system, for students to be able to answer the above question definitively. But you can tell a lot about how much students understand the principles of database system architecture by how they go about describing what the answer depends on.

    Before describing some potential bottlenecks, let’s eliminate one possibility: network bandwidth between query processing servers. It doesn’t matter whether the transactions table is small and fits easily on a single machine or is very large and is partitioned across thousands of machines — this query is almost perfectly partitionable — each machine can calculate its own cumulative sum for the partition of transaction records located locally, and the only communication across query processing servers that needs to happen is at the very end of the query, when each of the subtotals for each partition are added together.

    However, several other bottlenecks may exist. To understand where they may come from, let’s examine the path of a single record from the transactions table through the database query processing engine. Let’s assume for now that the transactions table is larger than the size of memory, and the record we will track starts off on stable storage (e.g. on magnetic disk or SSD).

    First, this record needs to be read from stable storage into the memory of the server that will process this record. Second, this record needs to be transferred from memory to the local cache/registers of the processing core that will end up processing this record. Third (finally), the processing core needs to perform two predicate evaluations on the record (on store_id and date), extract (if necessary) the transaction amount, and add it to the cumulative sum.

    Each of these three steps will have to be performed on each record of the transactions table, and each may become a bottleneck for query processing. The first step is a potential “disk bandwidth bottleneck” — the rate at which data can be transferred from stable storage to memory is simply too slow to keep up with subsequent parts of query processing. The second step is a potential “memory bandwidth bottleneck”, and the third step is a potential “CPU processing bottleneck”.

    For “big data” datasets where the size of data is significantly larger than the size of memory, the most common bottleneck is disk bandwidth. Disk bandwidth of the highest-end disks remain on the order of hundreds of megabytes per second, while memory bandwidth is usually at least an order of magnitude faster. Furthermore, very little work is required of the CPU per record (just two predicate evaluations and a sum) — database queries tend to be far less CPU-intensive than other domains (such as graphics rendering or scientific simulations). Hence, step 1 is often a bottleneck.

    There are three techniques commonly used to eliminate the disk-bandwidth bottleneck in database systems. First, you can increase the memory available to store data on each machine, and thus decrease the amount of data that must be read from disk at query time. Second, you can use storage arrays of large numbers of disks and a fat pipe to transfer data from this array of disks to the processing server. In other words, even if each disk can only transfer data at 100MB/sec, 20 disks combined can transfer data at an aggregate rate of 2GB/sec. Third, you can leverage software to increase the efficiency of data transfer. For example, data compression allows for an effectively larger number of records to be packed per bit transferred. Column-store database systems allow only the columns accessed by the query (in our example — store_id, date, and transaction_amount) to be read off of disk (so disk bandwidth need not be wasted reading in irrelevant data for a query). With some amount of effort, it is usually possible to eliminate the disk-bandwidth bottleneck through some combination of the three techniques described above.

    Figure 1. Optimizing Disk IO, Memory, and CPU Usage

    Figure 1. Optimizing Disk IO, Memory, and CPU Usage





    The next bottleneck to present itself is usually memory bandwidth. Unlike the disk bandwidth bottleneck, the memory bandwidth bottleneck cannot typically be solved via modifying the hardware configuration. Instead, the database system software needs to be intelligent about how to efficiently use memory bandwidth, so that every drop of memory bandwidth is utilized to the maximum possible extent. The two main techniques used for this are the same two software techniques mentioned above: compression and column-orientation.

    In order for compression to alleviate the memory bandwidth bottleneck, data must remain compressed in memory. Many database systems decompress data in the buffer pool after it is brought in from disk — this simplifies the system code, and makes data modifications easier to handle. Unfortunately this exacerbates the memory bandwidth bottleneck, as the full, uncompressed data is sent to the CPU whenever it needs to be processed. In contrast, well optimized systems (especially systems optimized for high performance data analysis) will keep data compressed in memory, and either operate directly on the compressed data, or only decompress in the CPU immediately prior to processing.

    Column-oriented page layout is a particularly important technique to eliminate the memory bandwidth bottleneck. Assume that store_id, date, and transaction_amount each take up 4 bytes per record (12 bytes), and that each record in the transactions table is 300 bytes. Furthermore, assume a CPU core cache line is 128 bytes. In the best case scenario for a traditional row-oriented data layout, store_id, date, and transaction_amount are all in the same 128-byte subset of the 300-byte record, and the 128-byte cache line containing these three attributes is sent from memory to the cache of the processing core that will process this record. In such a scenario 12 / 128 = 9.4% of the data transferred from memory to cache will be accessed by the CPU. In the worst case scenario, these three attributes are located within different cache lines of the 300-byte record, and as a result, only 12/300 = 4% of the data transferred from memory to cache will be accessed by the CPU.

    In contrast, if data is laid out on a page column-by-column rather than row-by-row, each cache line consists entirely of data from a single attribute of the table. Only cache lines corresponding to the three relevant attributes are sent from memory to the processing core cache, and hence nearly 100% of the data transferred from memory to cache will be accessed by the CPU. Hence, column-orientation usually improves the memory bandwidth efficiency by an order of magnitude, significantly alleviating the memory-bandwidth bottleneck.

    In summary, only with effort and intelligence (especially around compression and column-orientation) can the first two bottlenecks — disk-bandwidth and memory-bandwidth — be eliminated, and the bottleneck shifted to the CPU.

    To alleviate the CPU bottleneck, techniques must be used to maximize the efficiency of the processor. One clever way to accomplish this is to leverage the SIMD instruction set on modern CPUs such as the Intel Haswell using AVX2 vector processing. To understand how SIMD works, let’s first examine how the predicate of our example query (store_id = 20) would be evaluated normally (without leveraging SIMD instructions). For each record, the store_id would be extracted from the 128-byte cache line and sent to the 128-bit CPU register to do the comparison with the value “20”. This process happens sequentially — each store_id is extracted and sent to the CPU register for comparison to “20” one-after-the-other.

    Figure 2. Single CPU addition versus four with 128-bit vector registers

    Figure 2. Single CPU addition versus four with 128-bit vector registers










    Note that a 128-bit register can actually hold 128/8 = 16 bytes of data. Since each store_id takes up 4 bytes, you can actually store 4 store_ids in a single register. In such a situation where you can fit a “vector” of values within a single CPU register, the SIMD instruction set allows a programmer to perform the same exact operation in parallel to each element of the vector. In other words, it is possible to compare 4 different store_ids to 20 in parallel in a single CPU step. This effectively quadruples the CPU efficiency.

    We will perhaps go into more details of how to leverage SIMD features for database operations in future blog posts. The main takeaway that I want readers to get from this post is that leveraging SIMD instructions on modern CPUs (also known as “vectorized processing”) is entirely a CPU optimization. For such an optimization to make a difference in actual query performance, the disk-bandwidth and memory-bandwidth bottlenecks have to be removed first. Teradata Database builds cache line friendly columnar table structures in memory to exploit the Intel AVX2 vector processing.

    Teradata recently announced support for vectorized processing. While this is great news by itself, what is more interesting to me is the announcement that vectorized processing improves performance on real-life, actual Teradata queries by 10-20%. Using columnar arrays in memory, individual steps within a query run up to 50% faster. While a straightforward reading of these performance claims would seem to indicate that this improved performance is a testament to the high quality of this particular new feature, I hope that someone who has read this post carefully will understand that it means much more than that: it’s a testament to the high quality of the rest of the Teradata product to remove the disk-bandwidth and memory-bandwidth bottlenecks so that Intel CPU optimizations can actually make a bottom-line difference. It is an indication that Teradata has a legitimate column-store — a column-store that improves the efficiency of disk transfer by only transferring the relevant columns, improves the efficiency of memory transfer by avoiding cache pollution of irrelevant attributes, and keeps data in columns inside the CPU register for parallel processing at the hardware level.

    Learn more about Teradata’s Data Warehousing Solutions.

    daniel abadi crop BLOG bio mgmt



    Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

    Teradata Blogs Feed

Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc