Hadoop and the Data Warehouse: Hadoop 101 for Enterprise Data Professionals – Dr. Ralph Kimball Answers Your Questions
Thank you for the phenomenal response to our recent webinar with Dr. Ralph Kimball: Hadoop and the Data Warehouse: Hadoop 101 for Enterprise Data Professionals. Many of you submitted chat questions that we weren’t able to answer live. Happily, Dr. Kimball took the time to address the majority of these questions. Grab something to drink, sit back and check out his answers below (there are quite a few!)
Next up: Join Dr. Ralph Kimball and our own Eli Collins from Cloudera on 5/29/14 for the next topic in the series: Best Practices for Hadoop and the Data Warehouse: EDW 101 for Hadoop Professionals.
If you weren’t able to join us live, check out the replay.
- Is DW in the future still going to target “decision makers” or far beyond the “elite” group?
To be honest, I have never considered decision makers to be part an elite group. Rather I think anyone using information is a decision maker. This goes back to the very earliest pre-data warehouse systems, which were called Decision Support Systems.
- Outside of transactional systems (OLTP) do you think RDBMS will die out soon?
I took some pains in the webinar to state my belief that RDBMSs are a huge, permanent tool for organizations. They will always maintain their place for querying, reporting, and analyzing relational data. But I think Hadoop takes a lot pressure off these systems to operate on non-relational text and number data.
- Together Storage and Processing is obviously of value. What happens when one want to use Storage only or processing only? Values remains? What is it?
One early use case for Hadoop and HDFS has been as an inexpensive archive for all forms of data. As I pointed out in the webinar, this story is even more interesting because this archive is “active”. When you say storage only, I assume that when you decide to use the data, you will transfer the data to some other platform. That works for me. Not quite sure what you mean by processing only. Maybe in this case, you temporarily transfer the data into Hadoop for processing either to offload other computational resources, or to take advantage of unique analytic applications that run in the Hadoop ecosystem. Then when you are done, you delete the data from Hadoop. Again , I could see that happening, but in this case, given the storage advantages of Hadoop, seems like you might not want to do the delete…
- What would be the implications for the Power Users Experience?
I assume you mean analysts who have considerable computer experience, either with sophisticated Excel spreadsheets, or high end analytic tools such as SAS, R, or analytic extensions of RDBMSs such as MADlib in Impala or PostgreSQL. I think the short answer is that Hadoop is an ideal framework for supporting all these approaches. SA, R, and MADlib have all been available in Hadoop for a while.
- What does it mean by “Query engines can access HDFS files before ETL”? How is it possible?
To expand that sentence, what I mean is that “Query engines can access HDFS files via their metadata descriptions in HCatalog without requiring the data to be transferred to a relational table as in a conventional DBMS”. Thus the step of physically transferring the data can be avoided, in this exploratory BI scenario. You still have to interpret the HCatalog description of the file in your query engine, which is basically defining a view. You could say that this view declaration is a form of ETL, and I would grant you that.
- How about ELT?
Also: We are spearheading SQL-push down ELT Data Integration on Hadoop, how do you see that going in future?
ELT simply means performing data transformations using SQL through the query engine. Thus the SQL UPDATE and INSERT INTO statements are your main tools. This is the point where you sit down with the system reference manual for Hive, or Impala, or your relational query engine of choice, and carefully study how these commands are supported.
- ‘Familiar SQL’ Does this mean ANSI SQL?
Hive and Impala support the bulk of ANSI standard SQL, but not all of the advanced commands. For example, I believe that the full semantics of common table expressions (CTEs) coupled with UNION ALL is not completely supported at this time. But each release of SQL support from these engines chips away at these edge cases with the goal of full ANSI support in the foreseeable future. By the way, not all of the legacy RDBMSs support the full semantics of ANSI SQL.
- “Exploratory DW/BI” sounds pretty much data-driven. Does this capture the nature of the big data analytics?
The distinction between the two threads I described (exploratory DW/BI and high performance DW/BI) is in the performance of the queries, not the semantics of the analysis. In fact, in some cases the original unmodified data may be required for certain NoSQL or SemiSQL tools to analyze.
- In brief, what can a traditional DW do that a Hadoop-DW can not do?
Also: He is saying this feeds EDW’s and DW’s, but not replace so what does the EDW and DW give you that Hadoop cannot easily?
I have been asked many “can do versus cannot do” questions over the years, whether it is a particular modeling approach, or a row versus column DBMS, or a specific choice of hardware. And the answer, perhaps surprisingly, is that there is nothing that one environment can do that another cannot, in the strong sense. Certainly at this point in time, legacy EDWs are massively tuned and optimized, and have existing ETL pipelines for standard, familiar data. A Hadoop data warehouse cannot compete directly on performance for these legacy applications, although when cost is added to the equation, Hadoop may turn a lot people’s heads. I did make the point in the webinar that Hadoop right now is doing a lot of things that conventional EDWs have not figured out how to do.
- Is it a good idea to store the data warehouse (FACT, DIMENSIONS) in hadoop and create OLAP/Cubes directly with Impala?
Most OLAP/Cube deployments are (or should be) based on a star schema implementing dimensions and facts anyway. Your question is interesting and will be answered in detail in the second webinar from Kimball and Cloudera on May 29.
- Parquet columnar files – is this part of Hadoop? – haven’t seen it before…
Yes indeed. Try Googling “Parquet Hadoop” for a number of good sources.
- How can you describe Hadoop more than just a distributed data warehouse?
Stepping back from the webinar where we deliberately narrowed our focus to see familiar data warehouse entities, actually Hadoop is a comprehensive applications ecosystem that supports much more than the data warehouse mission. Many of the use cases such as real time data streaming and specialty applications are architected with the data warehouse in mind. You might want to come to one of the big Hadoop conferences to get an idea of Hadoop’s breadth.
- Are the parquet columnar files separate physical files or just indexes?
They are files, and in fact at the lowest level, each column is a file.
- Concept of Slowly Changing Dimension will go away in Hadoop DW?
Also: Does Dimensional modeling still play a role in BigData?
Also: What is the future of powerful dimensional modeling with Hadoop on board?
Slowly Changing Dimensions (SCDs) are a fundamental approach to handling time variance in your entities (dimensions). Hadoop does not affect either the requirement to track time variance nor the specific approaches, such as Type 1 and Type 2. Please tune in to the next Kimball-Cloudera webinar on May 29 where we will show you how implement SCDs in Hadoop.
- Is this disruptive to the SDLC of a DWBI that involves semantic modeling, fact/dims
The system development lifecycle (SDLC) of a DW/BI system is unaffected at the block diagram and logical levels. Obviously differences show up at the physical and programming levels. Again, please tune in to the next Kimball-Cloudera webinar on May 29 where we will take you through some of these details.
- Is performance a key drive for parquet files?
Yes, performance is the primary motivator for Parquet. Parquet is also more efficient in terms of disk space, and supports nested structures, which “raw” file formats typically do not (JSON being the exception). Because Parquet is supported throughout the Hadoop ecosystem it may be more portable than the particular raw format that is used.
- I know HDFS is Immutable and no updates are supportable but is that any way that we can expect the AID transaction getting applicable to HDFS?” Is that some thing in Impala that will come up with Parquet?
I don’t understand all of this question. Do you mean ACID transaction requirements which principally apply to OLTP systems and only indirectly to DW/BI systems? As to your second question, Impala works beautifully with Parquet files.
- Good morning Doctor Kimball, should we (DWH architect) forget about Star schema designs and jump on the Hadoop infrastructure?
No! You will hear a powerful argument in the second Kimball-Cloudera webinar as to why star schemas and dimensional modeling are as important as ever in a Hadoop data warehouse.
- ETL tools that talk to traditional EDWs understand the Metadata and the Data Model underneath. With Hadoop integration, aren’t we pushing a structure / data model upfront around a layer which can have a data model defined later at data retrieval? In other words, how easy or complex is the ETL from a processs perspective in the Hadoop environment?
ETL tools can move data from point A to point B, or they can invoke complex processes to transform the data, or they can do both at once. The ETL tools would read the HCatalog metadata and use that as what you call the schema. There is a subtle point in all of this, whether it is a Hadoop DW or a conventional DW, and that is that the final query schemas are actually defined in the query layer after the ETL is all done, not the system table/metadata layer.
- Should we use ETL to copy data from HDFS to columnar files?
Yes, this transformation step which is a basic and integral part of Hadoop is part of what I call ETL. Not sure that this anything more than labeling the step as ETL.
- Can Dr Kimball present a use case of Hadoop to be used in Banking sector or any other sector that don’t see this huge array of data types (audio/video/gps etc)?
Banking tends to be more data aware than perhaps you might realize. Hadoop invites a bank to store not only every transaction (trillions over time in a large bank) but a lot of “sub-transactional” data involving their customer’s life stages and behaviors and preferences. Additionally, banks and other financial institutions look broadly at real time data to detect check kiting, broker collusion, fraud, identity violations, and hacking. Although some of these applications may be considered to be “production”, it then takes a data warehouse to make many of the decisions about what to do. Finally, I am sure you are aware that all the banks keep digital images of every check, and what do you suppose they do with the output of 10 or more surveillance cameras in every bank? And what about all the documentation supporting a mortgage? This list goes on…
- Data in parquet is duplicate/replication of original HDFS files?
Parquet files can be created in two different basic ways. Either the Parquet file is a transformation of a separate original HDFS file, or the Parquet file itself is an original HDFS file because the data loader is commanded to create it directly on initial load.
- Can Hadoop totally replace usage of ETL tool like Informatica for loading data warehouse?
No! Hadoop is not an ETL tool, it is an application ecosystem. If you are lucky enough to already have an ETL tool, then you can use your familiarity with it to move and transform data within Hadoop in ways similar to the conventional world.
- Are you planning a new edition of the ‘toolkit’ – can’t wait to get it if you are.
Thanks for the encouragement!
- About ETL process: how can “merge statements” be realized (update existing data, insert new data), what are alternative recipes for this topic?
Hive and Impala do not currently support the MERGE statement introduced in SQL 2003 though I expect they will add it. In the mean time the same (more verbose) techniques from the pre-SQL 2003 world work. You can also use a programming language other than SQL to implement arbitrarily sophisticated merges.
- Where do we do dimensional modeling with Hadoop. If source data is going to come in as it is then are we saying that we are doing away with dimensional modeling?
Dimensional modeling is as necessary in Hadoop as it is in the conventional RDBMS environment. Please tune into the second Kimball-Cloudera webinar on May 29, where we will address exactly this point.
- The loading utilities are going to take care of converting to the desired format and storing in HDFS file systems?
- How can HDFS integrate with CDC for incremental load from sources?
As I teach in my ETL classes, change data capture (CDC) can take place at the original source of the data before it is transferred to HDFS, or it can be a rather typical ETL job entirely within HDFS where yesterday’s data is compared to today’s data and the differences detected. Maybe the short answer to this question is that logic of using CDC techniques is not materially affected by having HDFS as the destination.
- Hi, Sir, In DWH analysis end user need stable data for reporting but in Hadoop data continues flow, so how to overcome with this?
Actually Hadoop is in some ways more stable than other environments because once data is written to an HDFS file, it is immutable. HDFS files are write once. Yes Hadoop is purpose built to sustain very large transfer rates into HDFS, but it’s all additive. Nothing gets deleted.
- Dr Kimball, I am still little bit confused on the ETL perspective. You mention Informatica would be to read through sqoop. Would SSIS able to leverage the same approach. In addition, typically in ETL world, we start building our dimension and related data. how is the world doing it.
The benefit of Cloudera’s open source and open standards approach is that popular ETL tools can integrate over simple ODBC, extending existing ETL investments and skills to Hadoop. Cloudera provides a standard ODBC driver that tools such as SSIS can use.Please tune in to our next Kimball-Cloudera webinar on May 29 where we will describe building dimension tables and fact tables in detail.
- Sqoop is relation DW to HDFS or reverse or both ways?
If I understand your question, yes, Sqoop is used to transfer data both into and out of HDFS to or from a remote RDBMS.
- Do you think from feasibility perspective that traditional EDW compared to Hadoop is more focused on historical perspective and analysis. In other words are the best practices of implementing these different techniques/methods different in the way you want to handle/view the data?
Frankly I think that the 1990’s view that the data warehouse is only a library of historical data is way behind us. Shortly after the year 2000, operational users came flooding into the data warehouse with urgent mostly real time needs for watching processes and making decisions. Having said that, one of the side benefits of Hadoop and HDFS is that they have been built for extreme I/O performance. Rather than changing the whole paradigm of data warehousing, Hadoop will encourage some even more aggressive hybrid data warehouse applications where real time data is compared to historical data. We do that a fair amount today in conventional data warehouses but maybe not as aggressively as we will be able to do it in Hadoop.
- An RDBMS approach typically needs upfront data analysis before proceeding with design or development. How different is it for Hadoop, can you speak to whether data can get registered in HCatalog without analyzing data structure, unique keys/identifiers, foreign keys, etc?
You do not need to do upfront analysis – this is the idea behind “schema on read”. The best way to understand this is via the tutorial on how to create a table in Impala: http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_langref_sql.html?scroll=create_table_unique_1
- Dr. Kimball, Do you see Hadoop framework as part of Staging area?
Yes, possibly. One use case for Hadoop is to treat it entirely as a staging area where the data is made ready for ultimate export to the conventional EDW. And of course, we usually describe the staging area also as an archive which we keep around for various purposes like system recovery. Since Hadoop is an ideal “active archive” it meets this requirement as well.
- We have a lot of business rules in in our ETL tools today. So data is transformed before moving into a DW. IF the data is raw coming into Hadoop (write once read many), where do we do the “transformation work” and store rules? Will this be the BI tools now?
Your ETL tools should continue to maintain the business rules. The only non-negotiable difference between HDFS and other file systems is that HDFS files are, as you point out, write once and read many. So you will need to tweak the low level logic within your ETL pipelines to replace table updates with HDFS file appends or replacements.
- Can we actually use Hadoop for Exploratory BI and Big Data Analytics and have the data outputs / results pushed into the EDW for traditional reports / Self service, so that we leverage the capabilities of Hadoop and Traditional EDW frameworks?
Yes. In the second Kimball-Cloudera webinar, we also hope to show a hybrid report created from combining queries from Hadoop and a conventional EDW.
- Does Dr. Kimball foresee a situation where there is a RDBMS between Hadoop and a BI tool?
I don’t see the architecture that way. Hadoop is a general applications ecosystem, which contains a number of RDBMSs, like Hive, Impala, and others. But maybe you mean that Hadoop is being used as an ETL staging area, and the transformed tables are exported from Hadoop into an external RDBMS. That would fit your description.
- If a text file is brought into HDFS, and then turned into a Parquet file, if the text file is updated – say more records appended to this csv orig file, is the Parquet file also updated (pointer?) Or do we need to figure out how to ensure that the text files need to trigger the respective parquet file to be updated?
To the best of my knowledge there is no such automatic trigger intrinsic to the Parquet file that would detect new data appended to the original text file. You would need to have a separate ETL application for this.
- What role do Canonical models play in in Hadoop?
A canonical model is a standard model independent of a particular storage technology that allows data to be transferred between otherwise incompatible systems. Dimensional models are a type of canonical model because they can be deployed in quite different ways in specific RDBMSs. But more generally, XML and JSON payloads are widely used as canonical models, and Hadoop has extensive facilities for importing data in these formats.
- Don’t you think EDWs are obsolete with the volume, variety and velocity of data, especially if the value of the data is short-lived?
As I said in the webinar, conventional RDBMSs will be with us forever as they are superbly good at being OLTP engines and query targets for text and number data. Also, there are billions of lines of code in the ETL processes and BI tools that work quite well now. Although the bigness of Big Data is impressive, it is less interesting than the variety. That is where Hadoop really makes a sustainable difference as I argued in the webinar. Not sure what to make of “short lived” data. I have found that if the archiving cost of such data drops toward zero, reasons arise for keeping it.
- What are BI recommended tools for Hadoop/Cloudera env?
Most of your favorite BI tools have signed on to Hadoop a while ago, for the reasons I described in the webinar.
- Seems like a complete replacement for the Enterprise DW. Only thing I saw left for Enterprise DW is perhaps conforming dimensions to push back into Hadoop DW. Unless you are suggesting federation.
Also: Eventually why can’t EDH replace EDW?
Also: If the Cloudera solution utilizing Impala addresses the typical high performance concerns that an traditional enterprise DW / BI solution would require, why do we still say that the Hadoop – Cloudera stack is still complementary to traditional systems.
I really and truly believe that the Enterprise DW will be with us forever, operating jointly with the Hadoop DW. As I stated in previous answers here, the Enterprise DW has very deep roots, and it is superbly good at being a query target for text and number data, especially for that data that is produced by an associated OLTP application.
- Can someone please define “Serious BI” ?
It is the opposite of “Frivolous BI.”
- What’s the point of introducing a DW on top of Hadoop? Doesn’t that create an extra barrier between the data and the analysis?
Hadoop is an applications ecosystem. One of the applications is the data warehouse, which serves a certain range of query and analysis functions. There is nothing about the Hadoop DW that gets in the way of doing analysis with other kinds of tools. Actually I made the point in the webinar that the data warehouse and other types of applications can simultaneously access the same data files, which is pretty mind boggling compared to other environments.
- When would the EDW 101 for Hadoop Professional be coming out? Also would it be offered online (since I am currently based out of India)?
May 29, and of course it will be on-line, and will immediately be available on the Cloudera website for viewing at a convenient time.
- Is the Kimball group considering offering a new Hadoop 4 or 5 day class (like the classic Data Warehouse Toolkit class)? Would this be with or through Cloudera?
Also: Hi, Excellent webinar. Will there Ralph Kimball Training on how to take advantage of the Hadoop technologies?
A little early to answer these questions, but thanks for asking!
- What are the migration costs and implications?
I assume you mean migration of the whole data warehouse infrastructure, rather than the much narrower question of just migrating data itself. The answer to the big question depends on whether at one extreme you are bringing up a new data source or at the other extreme you are attempting a plug compatible replacement of an Enterprise DW subject area with a Hadoop DW. You’ll need some consulting help to scope this answer.
- How are the goals of the mission/vision of data warehousing at the beginning of the session actually accomplished?
The best reference (in my humble opinion) for achieving all these goals is the book “The Data Warehouse Lifecycle Toolkit, Second Edition” (Wiley, 2008), Kimball, Ross et al. Please see the book section on www.kimballgroup.com. If your question is the specific implementation of these goals in the Hadoop environment, please contact Cloudera.
- Can anyone provide a simple explanation of the difference between MPP and MapReduce?
MapReduce is a specific programming framework that sits on a distributed set of processing nodes that can be described as a kind of massively parallel processors (MPP) system. But there are MPP relational databases like Teradata where the processors “share nothing” and are not based on the MapReduce framework. As the name suggests, MapReduce applications involve a Map step where the requested job is carefully divided into equal sized processing chunks, and then a Reduce phase where the actual work is carried out.
- Is there value in Data Virtualization and Hadoop to expose data on Hadoop to the business?
Good comment. In some ways, Exploratory BI as I described it in the webinar is an exercise in data virtualization. The data is exposed at the query level as if it were housed in relational tables, but the data is still in the original possibly raw files.
- Can we have integration with an existing RDBMS, running a conventional engine like Oracle?
Yes, specifically at the BI layer where the BI tool separately fetches back answer sets from the Hadoop DW and the Oracle engine. We intend to show this in action with one of the major legacy RDBMSs in the second webinar on May 29.
- Is it possible to conform the data?
Yes, we will show how this is done in Hadoop in the second Kimball-Cloudera webinar on May 29.
- Dimensional modeling typically lead to more joins – since disk is not at premium do you think more de-normalized structures make sense in Hadoop?
I interpret your question as suggesting that dimension content be denormalized into the fact tables in order to eliminate the joins. As a general approach this is impractical because all the wide verbose dimension content would need to be replicated endlessly in every fact record. Even in the Parquet columnar file format this would not be effective in my opinion. Also, think of the implications of fully denormalizing all the dimensions into all the target fact tables. Now the same data is distributed everywhere, and administration of such dimensions would break down.
- Is HADOOP more compatible with a Kimball style or Inmon type DW?
I am somewhat partial to the Kimball approach myself. :)Bill Inmon supports a centralized notion of data warehousing in which all data is brought to a normalized data warehouse that is generally off limits to business users until IT publishes an aggregated “data mart” in response to a user request. Seems like this would be hard to do in a Hadoop environment where HDFS files, metadata, and query engines are mixed and matched at analysis time by the particular analysts. The Hadoop DW must coexist and share files with other non-DW applications outside of ether Kimball’s or Inmon’s purview. Just my opinion.
- Do you think the need for aggregate fact tables goes away with Hadoop and ability to query detail level data quickly?
In some sort of perfect world where all computations ran infinitely fast, there would be no reason to build aggregate fact tables. But when an aggregate fact table that is 100 times smaller than the base table can answer a request, then in almost any system, it will be 100 times as fast. With the advent of in-memory processing, and with fact tables that are not too large (say less than a terabyte), it may not be worth the administrative overhead to build an aggregate fact table. Remember that certain changes in a dimension will force the aggregate fact table to be rebuilt.
- Please I really need an answer, is it a good idea to store the DW (Facts, Dimensions) In Hadoop and use BI tools with IMPALA?
Yes, it’s a great idea. Please tune in to the next Kimball-Cloudera webinar where that is exactly what we will talk about.