What is a Data
Warehouse?
A Data Warehouse is not something shrink-wrapped i.e. you
take a set of CDs and install into a box and soon you have a Data Warehouse up
and running. A Data Warehouse evolves over time, you don’t buy it. Basically it
is about taking/collecting data from different heterogeneous sources.
Heterogeneous means not only the operating system is different but so is the underlying
file format, different databases, and even with same database systems different
representations for the same entity. This could be anything from different
columns names to different data types for the same entity.
Companies collect and record their own operational data, but
at the same time they also use reference data obtained from external sources
such as codes, prices etc. This is not the only external data, but customer
lists with their contact information are also obtained from external sources.
Therefore, all this external data is also added to the data warehouse.
As mentioned earlier, even the data collected and obtained
from within the company is not standard for a host of different reasons. For
example, different operational systems being used in the company were developed
by different vendors over a period of time, and there is no or minimal evenness
in data representation etc. When that is the state of affairs (and is normal)
within a company, then there is no control on the quality of data obtained from
external sources. Hence all the data has to be transformed into a uniform
format, standardized and integrated before it can go into the data
warehouse. In a decision support
environment, the end user i.e. the decision maker is interested in the big
picture. Typical DSS queries do not involve using a primary key or asking
questions about a particular customer or account. DSS queries deal with number
of variables spanning across number of tables (i.e. join operations) and
looking at lots of historical data. As a result large number of records are
processed and retrieved. For such a case, specialized or different database
architectures/topologies are required, such as the star schema. We will cover
this in detail in the relevant lecture.
Typical Queries
OLTP (On Line Transaction Processing) specific query
Select tx_date, balance from tx_table
Where account_ID = 23876;
D DW WH H s sp pe ec
ci if fi ic c q qu ue er ry y
Select balance, age, sal, gender from customer_table and
tx_table
Where age between (30 and 40) and
Education = ‘graduate’ and
CustID.customer_table = Customer_ID.tx_table;
0 Comments
Please do not enter any spam link in the comment box.