CONTINUOUS ASSESSMENT 5: BUSINESS INTELLIGENCE

Business Intelligence  BI

BUSINESS INTELLIGENCE

Business intelligence (BI) is a technology-driven process which encompasses a variety of tools, applications and methodologies to collect data from internal systems and external sources, prepare it for analysis and interpretation towards making sound decision for business development.

OBJECTIVE OF BUSINESS DEVELOPMENT

The ultimate objective of business intelligence is to improve the timeliness and quality of

information.

MAJOR COMPONENTS OF BUSINESS INTELLIGENCE

On-Line Analytical Package (OLAP)

OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modelling.

Real time BI

Real time BI allows for the real time distribution of metrics through email, messaging systems and/or interactive displays.

Advanced Analytics or Corporate Performance Management (CPM)

This set of tools allows business leaders to look at the statistics of certain products or services.

Data Warehousing

The data warehouse supports the physical propagation of data by handling the numerous enterprise records for integration, cleansing, aggregation and query tasks.

Data Sources

This component of BI involves various forms of stored data. It’s about taking the raw data and using software applications to create meaningful data sources that each division can use to positively impact business.

BENEFITS OF BUSINESS INTELLIGENCE

·         Facilitates fact-based decisions because the management of the organization would be able to see detailed, current data and reports on all aspects of the business.

·         Helps to improve sales and negotiations because it provides access to up-to-the-minute reports  which can be used to monitor sales and backup to negotiations with suppliers or other vendors.

·         Helps to eliminate wastes because  it can analyze transactions between subsidiaries and departments to identify areas of overlap or inefficiency business intelligence system works as a single, unified whole

·         Used by companies to assess own capabilities and identify opportunities

  • Provided good and standard organizational structure, created excellent multi-tasking abilities for the people and enhanced the flow of communication network in every organization.

LIMITATION TO BUSINESS INTELLIGENCE USAGE

  • Business intelligence system is very expensive and complex to implement

CONCLUSION

Powerful transaction -oriented information systems are now commonplace in every major industry, effectively levelling the playing field for corporations around the world. However, the business intelligence is still limited in use because most of the companies could not afford business intelligence system, and it’s very expensive and complex to implement

CA 4: DATA QUALITY AS PART OF DATA GOVERNANCE IN ORGANIZATIONS.

Data quality itself can be defined as “fitness for use”, a very broad definition that entails many aspects of quality.
publication1

COMPLETENESS
 
This is the extent to which the expected attributes of data are provided. However, Data Completeness definition is the ‘expected completeness’

 

Data Completeness

A customer data, for example, is considered as complete if the following conditions are satisfied:

 

  • All customer addresses, contact details and other information are available.
  • Data of all customers is available.
  • When it satisfies the ‘mandatory’ requirements

 

Data can be complete, but inaccurate

 

This occurs when:

 

  • All necessary details are available, but many of them are not correct.
  • The health records of all patients have ‘last visit’ date, but some of it contains the future dates.

 

TIMELINESS

 

This is the ability to release data or information to meet with users’ expectations. Thus, `’Data delayed’ is ‘Data Denied’.

 

TIMELY DATA

 

Data is timely when:

 

  • It is released within a given time frame e.g Companies are required to publish their quarterly results within a given frame of time.

 

  • When it is up to date e.g when Customers service is providing up-to date information to the customers.

 

Example of Data not being timely:

 

  • The courier package status is delivered, but it will be updated in the system only in the night batch run. This means that online status will not be available.

 

  • The financial statements of a company are published one month after the year-end.

 

  • The census data is available two years after the census is done.

 

CONSISTENCY

 

Consistency of Data means that data across the enterprise should be in synch with each other.

 

Data In-consistency: This is a situation whereby data is in synch in the narrow domain of an organization, but not in synch across the organization. For example:

 

  • A situation whereby a credit card is cancelled, and inactive, but the card billing status shows ‘due’.
  • A situation whereby an Airline promotion campaign closure date is Jan 31, and there is a passenger ticket booked under the campaign on Feb. 2.
  • Collection management system has the Cheque status as ‘cleared’, but in the accounting system, the money is not shown being credited to the bank account. Reason for this kind of inconsistency is that system interfaces are synchronized during the end-of-day batch runs.

 

VALIDITY AND INTEGRITY

 

Validity is the presentation of data that are devoid of errors while Integrity is associated with probity and coherence in presented data. All characteristics of the data including business rules, rules for how pieces of data relate, dates, definitions and lineage must be correct for data to be complete.

 

 

Data Validity and Example

A phone number is invalid when it contains extraneous items (errors) like symbols. For example a phone number is not expected to contain symbols like #, @, or % characters

 

Two types of Data Integrity and Example

 

  1. Entity Integrity: This is an integrity rule which states that every table must have a primary key and that the column or columns chosen to be the primary key should be unique and not null.

 

  1. Referential Integrity: The foreign key value refers to a primary key value of some table in the database. Depending on the rules of the business, a foreign key value can be null. In this case we are explicitly saying that either there is no relationship between the objects represented in the database or that this relationship is unknown.

 

ACCURACY

 

This refers to the correctness of the values stored for an object.

 

Example of Data Accuracy

 

If a personnel database has a BIRTH_DATE data element that expects dates in USA format, a date of 12/13/1941 would be correct. A date of 12/14/1941 would be inaccurate because it is the wrong value. A date of 13/12/1941 would be wrong because it is a European representation instead of a USA representation.

 

 

IMPACT OF DATA QUALITY

Impact of Data Quality on Business

  • Organizations that did put a focus on the quality of their data, saw a 15%-20% increase in revenue.
  • 77% of companies believe that Business Bottom line is affected by inaccurate and incomplete data

 

  • Due to poor data quality Businesses have seen 40% of their initiatives fail to achieve targeted benefits. This has significant effects on operational efficiency

 

  • Good implementation of Data quality initiative leads to 20%-40% increase in sales for a business
  • Poor data quality often breeds mistrust among internal departments and externally

 

 

Risk and Compliance impacts of Data Quality

  • Associated with credit assessment, investment risks, competitive risk, capital investment and/or development, fraud, and leakage, and compliance with government regulations, industry expectations, or self -imposed policies (such as privacy policies)

Financial Impact of Data Quality

  • Increased operating costs, decreased revenues, missed opportunities, reduction or delays in cash flow, or increased penalties, fines, or other charges.

Productivity Impacts of Data Quality

  • Increased workloads, decreased throughput, increased processing time, or decreased end -product quality.

 

HOW TO ENHANCE THE QUALITY OF ENTERPRISE DATA AS WELL AS STAKEHOLDERS’ CONFIDENCE

Because data quality improvement is a process and not an event, the following enterprise-wide disciplines should be phased in and improved upon over time:

  • A stronger personal involvement by management
  • High-level leadership for data quality
  • New incentives
  • New performance evaluation measures
  • Data quality enforcement policies
  • Data quality audits
  • Additional training for data owners and data stewards about their responsibilities
  • Data standardization rules
  • Metadata and data inventory management techniques
  • A common data-driven methodology

Specific Data Quality Improvement Practices

Data Profiling

This is done to uncover your data defects through data profiling or data archaeology. This is the process of analyzing the data for correctness, completeness, uniqueness, consistency, and reasonability.

Data Cleansing

After the extent of “dirty data” is known, the easiest place to start the data quality improvement process is by cleansing operational data at the time it is moved into DW databases where it is used for cross-organizational reporting.

Data Defect Prevention

To prevent future “dirty data” from being entered identifying the root causes for the data defects is expedient. The owners of the operational systems should plan to improve their programs and edit checks, unless the effort is unreasonably high.

Establishing Data Governance Group

A data governance group should be established at the enterprise level, which should be staffed with data administrators, metadata administrators, and data quality stewards:

CONCLUSION

• Poor data breeds mistrust internally and externally, enabling bad decisions and wastes precious financial resources and human efforts.

RECOMMENDATIONS
• Organizations need to be fully educated on the impacts that data quality and how to implement a data management strategy to avoid suffering from information crisis.

• A data governance group should be established at the enterprise level, which should be staffed with data administrators, metadata administrators, and data quality stewards.

CA 3: ASSOCIATION ANALYSIS

Q1: Lift Analysis
Please calculate the following lift values for the table correlating burger and chips below:
• Lift(Burger, Chips)
• Lift(Burgers, ^Chips)
• Lift(^Burgers, Chips)
• Lift(^Burgers, ^Chips)
• Please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?
Chips        ^Chips       Total Row
Burgers         600            400           1000
^Burgers       200            200            400
Total Column 800           600          1400

SOLUTION TO QUESTION 1
1a. Lift (Burgers, Chips)
s(Burgers u Chips) = 600/1400 = 0.428
s(Chips) = 800/1400 = 0.571
s(Burgers) = 1000/1400 = 0.714
LIFT(Burgers, Chips) = 0.428/(0.714 * 0.571) = 1.049
DECISION: Since Lift Correlation between (Burger, Chips) > 1
Therefore, Burgers and Chips are Positively Correlated

1b. Lift (Burgers, ^Chips)
s(Burgers u ^Chips) = 400/1400 = 0.285
s(^Chips) = 600/1400 = 0.428
s(Burgers) = 1000/1400 = 0.714
LIFT(Burgers, ^Chips) = 0.285/(0.428* 0.714) = 0.932

DECISION: Since Lift Correlation between Burger and ^Chips < 1
Therefore, Burgers and ^Chips are Negatively Correlated

1c. Lift(^Burgers, Chips)
s(^Burgers u Chips) = 200/1400 = 0.142
s(Chips) = 800/1400 = 0.571
s(^Burgers) = 400/1400 = 0.285
Lift (^Burgers, Chips) = 0.142/(0.571 * 0.285) = 0.872

DECISION: Since Lift Correlation between ^Burger and Chips 1
Therefore, ^Burgers and ^Chips are Positively Correlated

Q2:
Please calculate the following lift values for the table correlating shampoo and ketchup below:
• Lift(Ketchup, Shampoo)
• Lift(Ketchup, ^Shampoo)
• Lift(^Ketchup, Shampoo)
• Lift(^Ketchup, ^Shampoo)
Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

Shampoo                ^Shampoo                Total Row
Ketchup                       100                           200                         300
^Ketchup                    200                           400                         600
Total Column              300                           600                        900

SOLUTION TO QUESTION 2
2a. Lift (Ketchup, Shampoo)
s(Ketchup u Shampoo) = 100/900 = 0.111
s(Ketchup) = 300/900 = 0.333
s(Shampoo) = 300/900 = 0.333
Lift (Ketchup, Shampoo) = 0.111/(0.333*0.333) = 1.001

DECISION: Since Lift Correlation between Ketchup and Shampoo = 1
Therefore, Ketchup and Shampoo are Independent

2b. Lift(Ketchup, ^Shampoo)
s(Ketchup u ^Shampoo) = 200/900 = 0.222
s(Ketchup) = 300/900 = 0.333
s(^Shampoo) = 600/900 = 0.666
Lift (Ketchup, ^Shampoo) = 0.222/(0.333*0.666) = 1.001

DECISION: Since Lift Correlation between Ketchup and ^Shampoo = 1
Therefore, Ketchup and ^Shampoo are Independent

2c. Lift (^Ketchup, Shampoo)
s(^Ketchup u Shampoo) = 200/900 = 0.222
s(^Ketchup) = 600/900 = 0.666
s(Shampoo) = 300/900 = 0.333
Lift (^Ketchup, Shampoo) = 0.222/(0.666*0.333) = 1.001

DECISION: Since Lift Correlation between ^Ketchup and Shampoo =1
Therefore, ^Ketchup and Shampoo are Independent

2d. Lift(^Ketchup, ^Shampoo)
s(^Ketchup u ^Shampoo) = 400/900 = 0.444
s(^Ketchup) = 600/900 = 0.666
s(^Shampoo) = 600/900 = 0.666
Lift (^Ketchup, ^Shampoo) = 0.444/(0.666*0.666) = 1.001

DECISION: Since Lift Correlation between ^Ketchup and ^Shampoo = 1
Therefore, ^Ketchup and ^Shampoo are Independent

Q3: Chi Squared Analysis
Please calculate the following chi squared values for the table correlating burger and chips below (Expected values in brackets).
• Burgers & Chips
• Burgers & Not Chips
• Chips & Not Burgers
• Not Burgers and Not Chips
For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

Chips                  ^Chips                Total Row
Burgers                       900 (800)             100 (200)                  1000
^Burgers                     300 (400)             200 (100)                    500
Total Column            1200                       300                            1500

SOLUTION TO QUESTION 3
3a Burgers & Chips
Using formulae Chi-squared (χ2 )= ∑ (observed-expected) 2/ (expected)
Therefore, for Chi-square (Burger and Chips) = ∑ (900-800) 2/ (800)
Chi-square (Burger and Chips) = (100) 2/800
= 12.5

DECISION: Burgers & Chips are correlated because χ2 > 0;
Since Observed value (900) > Expected Value (800), Burgers & Chips are Positively Correlated

3b. Burgers & Not Chips
Using formulae Chi-squared (χ2 )= ∑ (observed-expected) 2/ (expected)
Therefore, for Chi-square (Burger and Not Chips) = ∑ (100-200) 2/ (200)
Chi-square (Burger and Not Chips) = (-100) 2/200
= 50

DECISION: Burgers & Not Chips are correlated because χ2 > 0
Since Observed value (100) < Expected Value (200), Burgers & Not Chips are Negatively Correlated

 

3c. Chips & Not Burgers Using formulae Chi-squared = ∑ (observed-expected) 2/ (expected) Therefore, for Chi-square (Chips and Not Burgers) = ∑ (300-400) 2/ (400) Chi-square (Chips and Not Burgers) = (-100) 2/400 = 25

DECISION: Chips and Not Burgers are correlated because χ2 > 0
Since Observed value (300) < Expected Value (400), Chips and Not Burgers are Negatively Correlated

 

3d. Not Burgers and Not Chips Using formulae Chi-squared = ∑ (observed-expected) 2/ (expected) Therefore, for Chi-square (Not Burgers and Not Chips) = ∑ (200-100) 2/ (100) Chi-square (Not Burgers and Not Chips) = (100) 2/100 = 100

DECISION: Not Burgers and Not Chips are correlated because χ2 > 0
Since Observed value (200) > Expected Value (100), Not Burgers and Not Chips are Positively Correlated

Q4: Chi Squared Analysis
Please calculate the following chi squared values for the table correlating burger and sausages below (Expected values in brackets).
• Burgers & Sausages
• Burgers & Not Sausages
• Sausages & Not Burgers
• Not Burgers and Not Sausages
For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

Sausages                  ^Sausages                         Total Row
Burgers                800 (800)                  200 (200)                             1000
^Burgers              400 (400)                   100 (100)                              500
Total Column     1200                             300                                      1500

SOLUTION TO QUESTION 4
4a. Burgers & Sausages
Using formulae Chi-squared ( χ2) = ∑ (observed-expected) 2/ (expected)
Therefore, for Chi-square (Burgers and Sausage) = ∑ (800-800) 2/ (800)
Chi-square (Burgers and Sausage) = (0) 2/800
= 0

DECISION: Burgers and Sausage are not correlated because χ2 = 0
Since Observed value (800) = Expected Value (800), Therefore, Burgers and Sausage are Independent

4b. Burgers & Not Sausages
Using formulae Chi-squared ( χ2) = ∑ (observed-expected) 2/ (expected)
Therefore, for Chi-square (Burgers and Not Sausage) = ∑ (200-200) 2/ (200)
Chi-square (Burgers and Not Sausage) = (0) 2/200
= 0

DECISION: Burgers and Not Sausage are not correlated because χ2 = 0
Since Observed value (200) = Expected Value (200), Therefore, Burgers and ot Sausage are Independent

4c. Sausages & Not Burgers
Using formulae Chi-squared ( χ2) = ∑ (observed-expected) 2/ (expected)
Therefore, for Chi-square (Sausage and Not Burger) = ∑ (400-400) 2/ (400)
Chi-square (Sausage and Not Burger) = (0) 2/400
= 0

DECISION: Sausage and Not Burger are not correlated because χ2 = 0
Since Observed value (400) = Expected Value (400), Therefore, Sausage and Not Burger are Independent

4d. Not Burgers and Not Sausages
Using formulae Chi-squared ( χ2) = ∑ (observed-expected) 2/ (expected)
Therefore, for Chi-square (Not Burgers and Not Sausages) = ∑ (100-100) 2/ (100)
Chi-square (Not Burgers and Not Sausages) = (0) 2/100
= 0

DECISION: Not Burgers and Not Sausage are not correlated because χ2 = 0
Since Observed value (100) = Expected Value (100), Therefore, Not Burgers and Not Sausage are Independent

Q5:
Under what conditions would Lift and Chi Squared analysis prove to be a poor algorithm to evaluate correlation/dependency between two events?

ANSWER TO QUESTION 5A
Lift and Chi Squared analysis are not the best algorithms to use when there are Null transactions.

Please suggest another algorithm that could be used to rectify the flaw in Lift and Chi Squared?

ANSWER TO QUESTION 5B
To rectify flaws in Lift and Chi Square, algorithms that can be used are – Kulczynski, AllConf, Jaccard, Cosine, MaxConf.

CONTINUOUS ASSESSMENT 2. TRY R AND SPEED DATING ATTRIBUTES

TRY R
Tutorial on R was done on the link: http://tryr.codeschool.com/
On completion of the tutorial a congratulatory message was displayed to me on: http://www.oreilly.com/data/try-r/congrats.html
Message displayed is as below:
Congratulations on the completion of Try R!
Continue learning about R with these books from O’Reilly.
Save 50% on ebooks, and 40% on print books from O’Reilly. Use discount code TRYR45 in the shopping cart. Offer does not apply “Print & Ebook” bundle pricing.

The badge below was also e mailed to me as evidence of completion. The link is https://www.codeschool.com/users/wajutope

publication3

SPEED DATING ATTRIBUTES
Step 1: Packages were installed on RStudio as:
# This is a code block
install.packages(‘ggplot2’) # Data visualization
install.packages(‘readr’) # CSV
install.packages(‘corrplot’)
#library(rattle)
install.packages(‘rpart’)
install.packages(‘rpart.plot’)
install.packages(‘RColorBrewer’)

The packages installed were loaded as:
# This is a code block
library(‘ggplot2’) # Data visualization
library(‘readr’) # CSV
library(‘corrplot’)
#library(‘rattle’)
library(‘rpart’)
library(‘rpart.plot’)

Step2
The data set for Dating was downloaded from the Speed Dating Experiment. The file was extracted from the Zipped folder and placed inside another folder on the desktop of my PC.
Kernel for Speed-Dating Attributes was used
Using the query below, data for Dating was run and uploaded on RStudio with sub-divisions as Male and Female,
dating # dividing between men and women
fem mal

Note: To run and upload the data on R Console, the above query was selected and “session” on the tool bar was clicked, “set working directory” was clicked. Then the directory for the data was chosen- in this case, the folder was extracted from zipped folder and saved on my desktop.

TESTING WHETHER THE ATTRIBUTES GO TOGETHER
To test whether the attributes go together, the following query was run on RStudio:
# Do the attributes go together?
kor corrplot(kor, method = “color”, order = “hclust”)
The result for the attributes is as below:

publication1

TESTING WHETHER THE ATTRIBUTES DIFFER DEPENDING ON GENDER
To know whether the attributes differ depending on gender, the following was run on RStudio:

publication1

The above shows that there is no difference in the attributes based on gender.

KNOWING THE MAXIMUM
To know the maximum, the following was run on RStudio:
max(abs(korM – korF))
The result shows that maximum is: [1] 0.07305632

DECISION TREE
To know what attributes explain the decision of a participant if he/she wants to see their partner again. The following was run on RStudio:

test_dat train_dat

correct print(table(prediction, corr_answer))

cat(“\n Correctly predicted: \n”)
(table(prediction, corr_answer)[1,1] +
table(prediction, corr_answer)[2,2])/length(corr_answer)
}

fit_both #fancyRpartPlot(fit_both)
plot(fit_both, mar = c(0,0.1,0,0))
text(fit_both)
The result shows the diagram below:

publication1

To predict both, the following was run on RStudio:
pred_both correct(pred_both, test_dat$dec_o)
The following result was generated:
corr_answer
prediction 0 1
0 949 263
1 213 575

Correctly predicted:
[1] 0.762

KNOWING WHETHER MEN AND WOMEN DIFFER AND WHETHER THE DIFFERENT ATTRIBUTES PLAY A ROLE
To know whether men and women differ and whether the different attributes play a role, the following was run on RStudio:
test_fem train_fem

test_mal train_mal

fit #fancyRpartPlot(fit)
plot(fit, mar = c(0,0.1,0,0))
text(fit)

The result below was generated

publication1

The following prediction was run on R Console:
pred

correct(pred, test_fem$dec_o)

The result shows that:
prediction 0 1
0 405 110
1 107 378

Correctly predicted:
[1] 0.783

The following was also run on RStudio:

fit2 #fancyRpartPlot(fit2)
plot(fit2, mar = c(0,0.1,0,0))
text(fit2)

The result generated is as below:

publication1

PREDICTION 2:

The following was run on R:

pred2 correct(pred2, test_mal$dec_o)
The below result was generated:

prediction 0 1
0 564 158
1 100 178

Correctly predicted:
[1] 0.742

INFORMATION GLEANED FROM THE DATA SET
There was no difference in the attributes based on gender. However, some attributes of male and female play a role in dating.
OTHER IDEAS THAT COULD BE REPRESENTED VIA R GRAPHICS
More data can be included to ascertain whether the correlation would be same or similar to the speed dating data
Researchers in marriage psychology can use the data to decide on the patterns of dating among male and female based on other attributes like ethnic group, occupation, age etc

CA 1: Fusion Table and Irish Population Heatmap Using 2011 Census data

Irish Population Heatmap Using 2011 Census data
Google drive is required for creating the fusion table. The first step was to find and import the raw data on Irish Population in the 2011 census. This was found at CSO websitehttp://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/

Data in the CSO link (above) was copied and pasted into Microsoft Excel table to make the raw data conform to the format for use in Fusion Tables. Also, to create the heat map, Irish KMZ data file was used. This was found at the Independent website: http://www.independent.ie/editorial/test/map_lead.kml

Raw data on Irish population in 2011 was cleaned by:

    • deleting empty rows,
    • Counties with total population were used while rows comprising of sub-countiese.g Dublin City, Finglas, Dun Laoghaire and South Dublin were deleted. Also for Galway, Cork and Limeric. North and South Tipperary were merged as a single county (Tipperary).
    • Ulster was removed because it comprises of places like Cavan, Donegal and Monaghan which are Counties in Ireland.
    • Connacht was deleted because it comprises of counties like Galway, Leitrim, Mayo, Roscommon and Sligo.
    • Munster was deleted.
    • The last data on “State” was deleted since it is the sum of the population. Therefore, only 26 Counties were used.

The cleaned table was imported to the Fusion table using the link : https://support.google.com/fusiontables/answer/2571232. The table was filtered to reduce the four (4) columns Province, Male, Female and Total to two (2), namely; Total Persons and County
Also, KML data was generated through http://www.independent.ie/editorial/test/map_lead.kml and imorted into Google Fusion Table.
Data table for Irish population in 2011 was then merged with the geographic KML information and the new table was Visualized on a map. The generated map was converted to the heat map of Ireland. The map was styled by clicking on “Change feature styles”, “Fill colour” and adapting the colours and population number ranges in “Buckets” to show a proper heatmap Legend.

PIE CHART REPRESENTATION OF TOTAL POPULATION OF IRELAND BASED ON COUNTY

It can be deduced from the above pie chart that Co.Dublin has the highest population (43.9%), followed by Cork (17.9%). The high population in Dublin can be attributed to the fact that Dublin and Cork have facilities like sea port, international airport and educational institutions and tourism makes which attracts the high population. However, Dublin is the capital city of Ireland and the commercial hub where most offices are located. Thus it has the highest population.

POPUULATION GRAPH OF IRELAND BASED ON SEX AND TOTAL PERSONS

Based on sex, Number of females in Ireland are slightly higher than their male counterpart. However, it can still be confirmed from the graph above that the highest population of people live in Dublin, followoed by Cork, then Co.Kildare.