Tagged: Modeling Toggle Comment Threads | Keyboard Shortcuts

  • admin 9:51 am on May 24, 2017 Permalink
    Tags: , , , , Modeling, ,   

    The Future of Health and Human Services Data Modeling (Part 2) 

    Latest imported feed items on Analytics Matters

     
  • admin 9:57 am on August 20, 2016 Permalink
    Tags: , , , , Modeling, ,   

    The Future of Health and Human Services data modeling (Part 1) 

    Latest imported feed items on Analytics Matters

     
  • admin 9:51 am on April 25, 2016 Permalink
    Tags: , Deaths, , , Modeling, , , , ,   

    Who’s next? Predicting Deaths in Game of Thrones – Part 2: Event-based survival modeling 

    Latest imported feed items on Analytics Matters

     
  • admin 9:53 am on December 6, 2015 Permalink
    Tags: , , , , Modeling,   

    Data Modeling Requires Detailed Mapping — Learn Why 

    Mapping is an important step to understanding your data and where the data resides in your ecosystem. Mapping takes us from the known to the unknown and is effectively accomplished by using mapping tools, adopting best practices, and having a common understanding of how the mappings will be used. But mapping does take a considerable amount of time and requires a person with extensive knowledge in the source or target system or both.

    Our team maps from an industry logical data model (core model) to access path building blocks — then to semantic structures such as dimensions, or lower to higher level facts.

    Access path building blocks (APBBs) are designed to help the semantic modeler develop dimensions and facts (which are specific and denormalized) for the semantic data model from the core data model (which is normalized and generalized). APBBs bridge the gap between the dimensional and normalized logical data models. To help the modeler in using APPBs, construction maps are included in the SMBBs to illustrate how a dimension can be built from the core data model (the appropriate Teradata industry data model) using the access path building blocks. The following picture is an example of an APBB construction map. The green boxes represent tables in the core data model (the Teradata Financial Services Data Model, in this example), the orange boxes are the APBBs, and the white boxes are the resulting dimensions.

    In this example the construction map visually layouts the data needed to identify a person who is insured by a policy (join path) and the diagram can also be understood by business analysts that may not model or write SQL.
    Performing mappings helps our team identify gaps in both models. In one case, core data may need to be added to the semantic data model in the other, the core data model may need to be expanded with data required by BI reports. The gap analysis identifies the missing pieces of data need to address the business requirements.

    Papierniak picture for blog Nov 30Initially, our group performed detailed attribute-to-attribute mapping, but then switched to higher-level, entity-to-entity mappings — to save time. The time saved on detailed attribute-to-attribute mappings, many of which are obvious, was instead focused on describing the purpose of the mappings thru filter and join notes. In the picture above Gender Type is an entity while Gender Type Description (in Gender Type) is an attribute.

    We found that we could save a considerable amount of time by reusing the mappings from areas such address or product in the same industry or across industries. And using a tool to perform the mappings makes them easily reusable, adaptable, and assists in standardizing around best practices.
    Mappings help with many things, including:
    • Focusing on specific areas in an ecosystem
    • Finding and resolving information gaps as well as design gaps
    • Creating the core layer views for the semantic layer
    • Establishing a reusable base for common content such as location or product
    • Supporting a more precise way of communicating and refining details during design and implementation

    Our team finds mapping to be useful, reusable, and educational — and is a worthwhile investment of our time.
    For mapping we use Teradata Mapping Manager (TMM) found on Teradata Developer Exchange. Information on our Industry Data Models (iDMs) and Solution Modeling Building Blocks is on http://www.teradata.com.

    Karen Papierniak cropKaren Papierniak is a Product Manager responsible for development of Teradata’s Industry Solution Modeling Building Blocks and Data Integration Roadmap portfolio — that spans eight major industries and used by customers worldwide. Karen’s roles at Teradata have been in software development, systems architecture, and visual modeling while working in a variety of industries including retail and communications.

    The post Data Modeling Requires Detailed Mapping — Learn Why appeared first on Data Points.

    Teradata Blogs Feed

     
  • admin 9:52 am on July 18, 2015 Permalink
    Tags: , , Modeling, , Primary, Selection   

    Optimization in Data Modeling 1 – Primary Index Selection 

    In my last blog I spoke about the decisions that must be made when transforming an Industry Data Model (iDM) from Logical Data Model (LDM) to an implementable Physical Data Model (PDM). However, being able to generate DDL (Data Definition Language) that will run on a Teradata platform is not enough – you also want it to perform well. While it is possible to generate DDL almost immediately from a Teradata iDM, each customer’s needs mandate that existing structures be reviewed against data and access demographics, so that optimal performance can be achieved.

    Having detailed data and access path demographics during PDM design is critical to achieving great performance immediately, otherwise it’s simply guesswork. Alas, these are almost never available at the beginning of an installation, but that doesn’t mean you can’t make “excellent guesses.”

    The single most influential factor in achieving PDM performance is proper Primary Index (PI) selection for warehouse tables. Data modelers are focused on entity/table Primary Keys (PK) since it is what defines uniqueness at the row level. Because of this, a lot of physical modelers tend to implement the PK as a Unique Primary Index (UPI) on each table as a default. But one of the keys to Teradata’s great performance is that it utilizes the PI to physical distribute data within a table across the entire platform to optimize parallelism. Each processor gets a piece of the table based on the PI, so rows from different tables with the same PI value are co-resident and do not need to be moved when two tables are joined.

    In a Third Normal Form (3NF) model no two entities (outside of super/subtypes and rare exceptions) will have the same PK, so if chosen as a PI, it stands to reason that no two tables share a PI and every table join will require data from at least one table to be moved before a join can be completed – not a solid performance decision to say the least.

    The iDM’s have preselected PI’s largely based on Identifiers common across subject areas (i.e. Party Id) so that all information regarding that ID will be co-resident and joins will be AMP-local. These non-unique PI’s (NUPI’s) are a great starting point for your PDM, but again need to be evaluated against customer data and access plans to insure that both performance and reasonably even data distribution is achieved.

    Even data distribution across the Teradata platform is important since skewed data can contribute both to poor performance and to space allocation (run out of space on one AMP, run out of space on all). However, it can be overemphasized to the detriment of performance.

    Say, for example, a table has a PI of PRODUCT_ID, and there are a disproportionate number of rows for several Products causing skewed distribution Altering the PI to the table PK instead will provide perfectly even distribution, but remember, when joining to that table, if all elements of the PK are not available then the rows of the table will need to be redistributed, most likely by PRODUCT_ID.

    This puts them back under the AMP where they were in the skewed scenario. This time instead of a “rest state” skew the rows will skew during redistribution, and this will happen every time the table is joined to – not a solid performance decision. Optimum performance can therefore be achieved with sub-optimum distribution.

    iDM tables relating two common identifiers will usually have one of the ID’s pre-selected as a NUPI. In some installations the access demographics will show that other ID may be the better choice. If so, change it! Or it may give leave you with no clear choice, in which case picking one is almost assuredly better than
    changing the PI to a composite index consisting of both ID’s as this will only result in a table no longer co-resident with any table indexed by either of the ID’s alone.

    There are many other factors that contribute to achieving optimal performance of your physical model, but they all pale in comparison to a well-chosen PI. In my next blog we’ll look at some more of these and discuss when and how best to implement them.

    Jake Kurdsjuk Biopic-resize July 15

    Jake Kurdsjuk is Product Manager for the Teradata Communications Industry Data Model, purchased by more than one hundred Communications Service Providers worldwide. Jake has been with Teradata since 2001 and has 25 years of experience working with Teradata within the Communications Industry, as a programmer, DBA, Data Architect and Modeler.

    The post Optimization in Data Modeling 1 – Primary Index Selection appeared first on Data Points.

    Teradata Blogs Feed

     
  • admin 9:51 am on May 14, 2015 Permalink
    Tags: , , , Modeling,   

    Your Big Data Initiative may not Require Logical Modeling 

    By: Don Tonner

    Logical Modeling may not be required on your next big data initiative.  From experience, I know when building things from scratch that a model reduces development costs, improves quality, and gets me to market quicker.  So why would I say you may not require logical modeling?

    Most data modelers are employed in forward engineering activities in which the ultimate goal is to create a database or an application used by companies to manage their businesses.  The process is generally:

    • Obtain an understanding of the business concepts that the database will serve.
    • Organize the business information into structured data components and constraints—a logical model.
    • Create data stores based on the logical model and let the data population and manipulation begin.

    Forward engineering is the act of going from requirements to a finished product. For databases that means starting with a detailed understanding of the information of the business, which is found largely in the minds and practices of the employees of the enterprise. This detailed understanding may be thought of as a conceptual model. Object Role Model diagramVarious methods have evolved to document this conceptual richness; one example is the Object Role Model.

    The conceptual model (detailed understanding of the enterprise; not to be confused with a conceptual high level E/R diagram) is transformed into a logical data model, which organizes data into structures upon which relational algebra may be performed. The thinking here is very mathematical. Data can be manipulated mathematically the same way we can manipulate anything else mathematically. Just like you may write an equation that expresses how much material it might take for a 3D printer to create a lamp, you may write an equation to show the difference between the employee populations of two different corporate regions.

    The image that most of us have of a data model is not equations, variables or valid operations, but is the visual representation of the structures that represent the variables. Below you can see structures as well as relationships which are a kind of constraint.

    UData Structures and Relationshipsltimately these structures and constraints will be converted into data stores, such as tables, columns, indexes and data types, which will be populated with data that may be constrained by some business rules.

    Massively parallel data storage architectures are becoming increasingly popular as they address the challenges of storing and manipulating almost unimaginable amounts of data.   The ability is to ingest data quickly is critical as volumes increase. One approach is receiving the data without prior verification of the structure. HDFS files or JSON datatypes are examples of storage that do not require knowledge of the structure prior to loading.

    OK, imagine a project where millions of readings from hundreds of sensors from scores of machines are collected every shift, possibly into a data lake. Engineers discover that certain analytics performed on the machine data can potentially alert us to conditions that may warrant operator intervention. Data scientists will create several analytic metrics based on hourly aggregates of the sensor data. What’s the modeler’s role in all this?

    The models you are going to use on your big data initiative likely already exist.  All you have to do is find them.

    One thing would be to reverse engineer a model of the structures of the big data, which can provide visual clues to the meaning of the data. Keep in mind that big data sources may have rapidly changing schemas, so reverse engineering may have to occur periodically on the same source to gather potential new attributes. Also remember that a database of any kind is an imperfect representation of the logical model, which is itself an imperfect representation of the business. So there is much interpretation required to go from the reverse engineered model to a business understanding of the data.

    One would also start reviewing an enterprise data model or the forward engineered data warehouse model. After all, while the big data analytic can help point out which engines are experiencing conditions that need attention, when you can match those engine analytics to the workload that day, the experience level of the operator, the time since the last maintenance, then you are greatly expanding the value of that analytic.

    So how do you combine the data together from disparate platforms? A logical modeler in a forward engineering environment assures that all the common things have the same identifiers and data types and this is built into the system. That same skill set needs to be leveraged if there is going to be any success performing cross-platform analytics. The identifiers of the same things on the different platforms need to be cross validated in order to make apples to apples comparisons. If analytics are going to be captured and stored in the existing Equipment Scores section of the warehouse, the data will need to be transformed to the appropriate identifiers and data types. If the data is going to be joined on the fly via Teradata QueryGrid™, knowledge of these id’s and datatypes is essential for success and performance.

    There are many other modern modeling challenges, let me know what has your attention.

    Don Tonner, Teradata Architecture and Modeling Solutions team Don Tonner is a member of the Architecture and Modeling Solutions team, and has worked on several cool projects such as Teradata Mapping Manager, the unification modules, and Solution Modeling Building Blocks.  He is currently creating an Industry Dimensions development kit and working out how models might be useful when combining information from disparate platforms.  You can also reach him on Twitter, @BigDataDon.

    The post Your Big Data Initiative may not Require Logical Modeling appeared first on Data Points.

    Teradata Blogs Feed

     
  • admin 9:51 am on March 3, 2015 Permalink
    Tags: , , , , Modeling,   

    Data-Driven Design: Smart Modeling in the Fast Lane 

    In this blog, I would like to discuss a different way of modeling data regardless of the method such as Third Normal Form or Dimensional or Analytical datasets. This new way of data modeling will cut down the development cycles by avoiding rework, be agile, and produce higher quality solutions. It’s a discipline that looks at requirements and data as input into the design.

    A lot of organizations have struggled getting the data model correct, especially for application, which has a big impact on different phases of the system development lifecycle. Generally, we elicit requirements first where the IT team and business users together create a business requirements document (BRD).

    Business users explain business rules and how source data should be transformed into something they can use and understand. We then create a data model using the BRD and produce a technical requirements documentation which is then used to develop the code. Sometimes it takes us over 9 months before we start looking at the source data. This delay in engaging data almost every time causes rework since the design was based only on requirements. The other extreme end of this is when a design is based only on data.

    We have always either based the design solely on requirements or data but hardly ever using both methods. We should give the business users what they want and yet be mindful of the realities of data.

    It has been almost impossible to employ both methods for different reasons such as traditional waterfall method where BDUF (Big Design Up Front) is introduced without ever looking at the data. Other reasons are we work with data but the data is either created for proof of concept or testing which is farther from the realities of production data. To do this correctly, we need JIT (Just in Time) or good enough requirements and then get into the data quickly and mold our design based on both the requirements and data.

    The idea is to get into the data quickly and validate the business rules and assumptions made by business users. Data-driven design is about engaging the data early. It is more than data profiling, as data-driven design inspects and adapts in context of the target design. As we model our design, we immediately begin loading data into it, often by day one or two of the sprint. That is the key.

    Early in the sprint, data-driven design marries the perspective of the source data to the perspective of the business requirements to identify gaps, transformation needs, quality issues, and opportunities to expand our design. End users generally know about the day to day business but are not aware of the data.

    The data-driven design concept can be used whether an organization is practicing waterfall or agile methodology. It obviously fits very nicely with the agile methodologies and Scrum principles such as inspect and adapt. We inspect the data and adapt the design accordingly. Using DDD we can test the coverage and fit of the target schema, from the analytical user perspective. By encouraging the design and testing of target schema using real data in quick, iterative cycles, the development team can ensure that target schema designed for implementation have been thoroughly reviewed, tested and approved by end-users before project build begins.

    Case Study: While working with a mega-retailer, in one of the projects I was decomposing business questions. We were working with promotions and discounts subject area and we had two metrics: Promotion Sales Amount and Commercial Sales Amount. Any item that was sold as part of a promotion is counted towards Promotion Sales and any item that is sold as regular is counted towards Commercial Sales. Please note that Discount Amount and Promotion Sales Amount are two very different metrics. While decomposing, the business user described that each line item within a transaction (header) would have the discount amount evenly proportioned.

    Data driven design graphicFor example – Let’s say there is a promotion where if you buy 3 bottles of wine then you get 2 bottles free. In this case, according to the business user, there would be discount amount evenly proportioned across the 5 line items – thus indicating that these 5 line items are on promotion and we can count the sales of these 5 line items toward Promotion Sales Amount.

    This wasn’t the case when the team validated this scenario against the data. We discovered that the discount amount was only present for the “get” items and not for the “buy” items. Using our example, discount amount was provided for the 2 free bottles (get) and not for 3 bottles (buy). This makes it hard to calculate Promotion Sales Amount for the 3 “buy” items since it wasn’t known if the customer just bought 3 items or 5 items unless we looked at all the records, which was in millions every day.

    What if the customer bought 6 bottles of wine so ideally 5 lines are on promotion and the 6th line (diagram above) is commercial sales or regular sales? Looking at the source data there was no way of knowing which transaction lines are part of promotion and which aren’t.

    After this discovery, we had to let the business users know about the inaccuracy for calculating Promotion Sales Amount. Proactively, we designed a new fact to accommodate for the reality of data. There were more complicated scenarios that the team discovered that the business user hadn’t thought of.

    In the example above, we had the same item for “buy” and “get” which was wine. We found a scenario, where a customer bought a 6 pack of beer then got a glass free. This further adds to the complexity. After validating the business rules against source data, we had to request additional data for “buy” and “get” list to properly calculate Promotion Sales Amount.

    Imagine finding out that you need additional source data to satisfy business requirements nine months into the project. Think about change request for data model, development, testing etc. With DDD, we found this out within days and adapted to the “data realities” within the same week. The team also discovered that the person at the POS system could either pick up a wine bottle and times it by 7 or he could “beep” each bottle one by one. This inconsistency makes a lot of difference such as one record versus 7 records in the source feed.

    There were other discoveries we made along the way as we got into the data and designed the target schema while keeping the reality of the data in mind. We were also able to ensure that the source system has the right available grain that the business users required.

    Grover Sachin bio pic blog small

    Sachin Grover leads the Teradata Agile group within Teradata. He has been with Teradata for 5 years and has worked on development of Solution Modeling Building Blocks and helped define best practices for semantic data models on Teradata. He has over 10 years of experience in IT industry as a BI / DW architect, modeler, designer, analyst, developer and tester.

    The post Data-Driven Design: Smart Modeling in the Fast Lane appeared first on Data Points.

    Teradata Blogs Feed

     
  • admin 9:51 am on January 11, 2015 Permalink
    Tags: Brave, , , Modeling, , ,   

    Brave New World: A Primer for the Evolving Practice of Data Modeling 

    The term “data-driven” is gaining huge momentum and for good reason: never before have every organization’s data assets become so useful, valuable – and complicated. That’s exactly why the subject of data modeling today is a HOT TOPIC – as well as data normalization. In this blog we will talk about “modeling the data.” My approach here is directed towards my general readers — business people or IT executives who anticipate becoming involved or are involved in a data warehouse or big data project.

    Organize Your Data

    What is meant by “modeling the data?” Modeling the data is about data and how to organize it. An enterprise has data – lots of it. Airlines have passenger records, flight schedules, reservation transactions, and marketing promotions. Telecommunication companies have call detail records, contracts and payment history. Banks have accounts, customers, transactions, and channels. Retailers have point of sale transactions, inventory and reward programs. And now it is not just data, but it is BIG DATA. There are clickstream, e-mail and sensor data to decipher and add to the pot.

    But it is not one pot – it is lots of little pots. And these little pots of data are spread all over the enterprise and they keep multiplying like Mickey Mouse’s buckets of water from Fantasia. If you are a bank, you may have one bucket full of loan accounts, another bucket full of deposit accounts and another bucket full of ATM transactions. You need a way to bring together and manage these pots but you have to do it in a systematic manner – after all you need to send reports to the government regulators and the information needs to be correct.

    disney sorc apprent

     

     

     

     

     

    Walt Disney’s Fantasia – The Sorcerer’s Apprentice 1940

    Whatever the business reason is, you know you need to integrate your data because the only way to produce regulatory reports for risk management, or to understand the many ways your customers interact with you and why, or to measure customer profitability, or to market to your customer as if they were the only one, is to bring the data together into one happy pot. This is the business purpose, and a business purpose drives the scope of your data warehouse.

    Model the Data

    Once you decide your business purpose, one of the early next steps is to “model the data” based on that purpose. Business purposes could include regulatory reporting or integrated channel marketing. Modeling the data involves both business people and IT people. If you think about all the kinds of data you have in the enterprise – or even one department – you could come up with hundreds if not thousands of kinds of data that you need to understand and run your business. What do I mean by “kinds of data”? In this context I mean labels or descriptors of data (or metadata) such as first name, middle name, last name, residence street address, residence city, cell phone number, etc. Each of these kinds of data needs to be organized into groups. In a relational data base world these groups are tables. Think of it as a lot of Excel spreadsheets where the column labels are the kinds of data and the rows are the data itself. And you will have not only one spreadsheet but you could have hundreds. You need to manage all these tables because they all relate to each other in some manner – they are interconnected.

    The business people need to be involved in this step because the data model (how the data is organized into tables) is based on the rules of the business – how the business works. The IT people are involved because they know the intricacies of data modeling and what questions to ask of the business people.

    When we “model the data” we create a graphic with boxes and lines to show each table and how they relate to other tables. A table of loan accounts needs to relate to the table of customers so that you know who the account holders are. In the below example we had to add a third table (INDIVIDUAL ACCOUNT) to allow for the fact that one account can have many individual account holders (e.g. joint accounts) and an individual can have many accounts. This is based on a business rule. This graphic acts as a communication tool (similar to an architectural blueprint) between and within the business and IT.

    Modeling the data does not have to be a huge step. Pick a business scope that is small and manageable as your first project and which will give you business value. You can also use predefined industry data models as a reference to jump start this activity and avoid reinventing the wheel.

    data model blog nancy

     

     

     

     

    Data Model Example – Accounts and Individuals Reflect Business Rules

    Data Modeling Summarized

    Once you decide that you need to integrate your data based on a business purpose, then modeling the data is one of the first steps you will do in organizing your data. It does not have to be a huge step but can be driven by a finite business purpose and can be accelerated by using existing industry data models as a reference. In my next blog I will discuss normalization.

    Kalthoff Work resized Photo (2)

    About the Author: Nancy Kalthoff is the product manager and original data architect for the Teradata financial services data model (FSDM) for which she received a patent. She has been in IT over 30 years and with Teradata over 20 years concentrating on data architecture, business analysis, and data modeling.

    Teradata Blogs Feed

     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel