1. 程式人生 > >大資料分析學習筆記(Z檢驗,分類器以及Association Rule)

大資料分析學習筆記(Z檢驗,分類器以及Association Rule)

大資料分析學習筆記(Z檢驗,分類器以及Association Rule)
Task 1 – Hypothesis Testing
To improve student learning performance, a teacher developed two new learning approaches, called “approach1” and “approach2” in short. To analyze the effectiveness of these approaches, the teacher randomly selected N students. For N1 of them, he applied “approach1” and for N2 of them, he applied “approach2”. For the rest (N- N1- N2) students, he applied nothing. After a period of time, the teacher conducted a test on all the N students and evaluated the performance of each student with a performance score. The evaluation result is stored in “A1_performance_test.csv”, which is provided with this assignment. In this task, you will use hypothesis testing to help this teacher to answer the following questions:
1. Whether the two new learning approaches can effectively improve student learning performance?
2. In terms of improving student learning performance, whether the two approaches are significantly different from each other?

To answer these two questions, the hypothesis test is necessary, and Z-test will be implied in this task.
whether two approaches can effectively improve student learning outcome?
In this question, Approach1 and Approach2 can be seen as an entity.
I set the null hypothesis:
H0: the approach1 and approach2 do not effectively improve student learning performance.
Also, I set the alternative hypothesis:
H1:Approach1&2 do effectively improve student learning performance.
The first step is getting the data from the dataset. In RStudio, I read csv to get the original data, separate them into 2 group as one with (noapproach) and one for others(approach1|approach2), note as x and y. For both question I set the confidence level=0.95 as usual.

summary(datas)
approach performance
approach1 :197 Min. :-23.39
approach2 :219 1st Qu.: 43.04
no_approach:184 Median: 68.52
Mean : 68.35
3rd Qu.: 91.40
Max. :161.37
Then by the summary command, we could quickly get an overview of this dataset. There are 197 students study with approach1, 219 with approach2 and the rest of them, 184 students, with no_approach.
noapproach<-subset(datas,approach==“no_approach”)
Here we get the first group that scores with no approach.
x<-noapproach[,2]#get the performance of no_approach
approach12<-subset(datas,approach==“approach1”|approach==“approach2”)
y<-approach12[,2] #get the performance of approached
Here we get the second group that approached. Then we could do Z-test.
That is
z.test(x,y,alternative = “less”,mu=mean(x),sigma.x = sd(x),
sigma.y = sd(y), conf.level = 0.95)#process the Z-test
With x and y, the Z-test results as below:
Two-sample z-Test
data: x and y
z = -30.445, p-value < 2.2e-16
alternative hypothesis: true difference in means is less than 40.93637
95 percent confidence interval:
NA -35.19727
sample estimates:
mean of x mean of y
40.93637 80.48179
Mean(x)=40.93637 seems much less than mean(y)=80.48179. The p-value<2.2*10-16 < 0.05 which means we can reject the H0.
That means Approach1&2 do significantly improve student learning performance.

In terms of improving student learning performance, whether the two approaches are significantly different from each other?
Set the H0: There is not a significantly different from approach1and approach2 in terms of improving student learning outcome.
Then the H1 is: The Approach1 is significant less than Approach2
Then the data will be read from the file be separated with two groups: x=approach1 and y=approach2.
The code is similar like question1 and it will show at the end of report.
The result is shown as below:

z.test(x,y,alternative = “less”,mu=mean(x),sigma.x = sd(x),

  •    sigma.y = sd(y), conf.level = 0.95)
    

    Two-sample z-Test
    data: x and y
    z = -27.979, p-value < 2.2e-16
    alternative hypothesis: true difference in means is less than 77.34459
    95 percent confidence interval:
    NA -1.061961
    sample estimates:
    mean of x mean of y
    77.34459 83.30384
    Mean(x)<mean(y)
    The p-value<2.2*10-16 < 0.05 which means we can reject the null hypothesis.
    Hence, the approach1 is less than approach2 in terms of improving student outcome.
    In conclusion, based on the result, we can see that the two new learning approaches does significantly improve students’ scores. Comparing approach1 and approach2, the approach2 seems more effective than approach1. It is suggested that teacher can improve students’ scores by using approach2.

    #Task2–Clustering
    Iris dataset was collected by Sir Ronald Aylmer Fisher, a great mathematician and statistician, in 1936. This dataset has been provided with standard R distribution. Load this dataset into your R workspace and study it. In this task, you will perform clustering on this dataset based on its four attributes of “Sepal.Length”, “Sepal.Width”, “Petal.Length”, and “Petal.Width”.

There are two parts of data in Iris.
The first part is with four features which are: Sepal.Length, Sepal.width, Petal.Length and Petal.Width. The second part is with 3 different labels: setosa, versicolor and virginica, for each there are 50 records recorded in the dataset. The summary of them are shown as below.

summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

  1. iris_v=iris[,-5]
    plot(iris_v)

Plot2.1: the scatterplot matrix between the four attributes
3. To achieve the K-means analysis, First I create a new variable “new” to store the data Iris which could help me prepare the data.
Then I use command: newKaTeX parse error: Expected 'EOF', got '#' at position 212: …15) new <- iris#̲create a new va…Species <- NULL#clean the species in new which could let the dataset form in one format
for(k in 1:15) wss[k]<-sum(kmeans(new,centers=k,nstart = 25)$withinss) #k means the K value.
plot(1:15, wss,type=‘b’, xlab=“number of cluster”, ylab=“within sum of squares”)#this plot could help me find the K value.(elbow)
Then the plot can be present:

Plot2.2: Determin the Number of Clusters
As this plot shows, x means numbers of cluster and y means within sum of squares.
Then the elbow of the line is what we are looking for—the K value. We pick the elbow K=3. Then I run kmeans algorithm,for 25 times as well.
km <- kmeans(new,3, nstart = 25)#process the k means algorithm randomly for 25times.
Km #show it
the result is:
K-means clustering with 3 clusters of sizes 62, 38, 50

Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.901613 2.748387 4.393548 1.433871
2 6.850000 3.073684 5.742105 2.071053
3 5.006000 3.428000 1.462000 0.246000

Clustering vector:
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1
[59] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2 2 2 1 1 2
[117] 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 1

Within cluster sum of squares by cluster:
[1] 39.82097 23.87947 15.15100
(between_SS / total_SS = 88.4 %)

Available components:
[1] “cluster” “centers” “totss” “withinss” “tot.withinss” “betweenss” “size”
[8] “iter” “ifault”

It shows that (between_SS / total_SS = 88.4 %)
Then I draw the plot by
plot(new[c(“Sepal.Length”, “Sepal.Width”)], col = kmKaTeX parse error: Expected 'EOF', got '#' at position 9: cluster)#̲plot of sepal p…centers[,c(“Sepal.Length”, “Sepal.Width”)], col = 1:3, pch = 8, cex=2)#plot of sepal with center

Plot2.3: K-Means clustering analysis on Sepal
plot(new[c(“Petal.Length”, “Petal.Width”)], col = kmKaTeX parse error: Expected 'EOF', got '#' at position 9: cluster)#̲plot of petal p…centers[,c(“Petal.Length”, “Petal.Width”)], col = 1:3, pch = 8, cex=2)#plot of petal with center

Plot2.4: K-Means clustering analysis on Petal
However, the plot2.3 and plot2.4 shows that the result is not very clear between the ‘black’ class and the ‘red’ class. To solve it, I use k-means algorithm again. But this time I separate the sepal and petal, classify them independently by:
km <- kmeans(new[,1:2],3, nstart = 25)
km <- kmeans(new[,3,4],3, nstart = 25)
But this time I only classify two attribute: Petal.Length and Petal.Width.
Here are the plots:

Plot2.5: K-Means clustering analysis on Sepal

Plot2.6: K-Means clustering analysis on Petal
As we can see in the plot, the result seems better than before.

(Plots have been shown on Q3) As these plot shows; the cluster have been well separated from each other in a better way. Initially, there are two cluster in Plot1.3 which do not have a definite boundary between each other. After optimizing, it becomes better. The number of points on each cluster is evenly distributed. No one have only a few points. The centroids appear far away from each other.
5.
For the hierarchical agglomerative clustering, first a two-dimension array (such as: Sepal.Length & Sepal.Width) shall be created. Then I use dist() function to change each value into distance via Euclidean. I use “mcquitty” to find the points that their distances in the array are most look like and classified together. Finally, the hierarchical agglomerative clustering is done by
distancearray=cbind(new S e p a l . L e n g t h , n e w Sepal.Length,new Sepal.Width) #preparation for the HA algorithm. create a 2-dimension array.
out.dist=dist(distancearray,method=“euclidean”) #change the value into euclidean distance.
out.hclust=hclust(out.dist,method=“mcquitty”)#h cluaster by mcquitty(can use “median”,“single”, “complete”…as well)
plot(out.hclust)#draw the plot as well
rect.hclust(out.hclust,k=3) #based on the previous plot,seperate the class with squares and show it.
The result:

Plot2.7: Hierarchical Agglomerative Clustering on the Iris Dataset

This kind of cluster seems not very clearly although it has a definite boundary.
This algorithm aims to classify the items with bottom-up method. Compare to K-means algorithm, it takes much more time and less computation to finalize the clustering. In terms of result, it provides a clearer result.

#Task3– Association Rule
Students of different grade, gender, and enrolment took part in a test. The test result “Success” or “Not Success” is recorded for each student and saved in “A1_success_data.csv” provided in this assignment. In this task, you will use association rule to mine interesting relationships between these four attributes.
1.
In this task I need to use the “arules” library. ( library(arules))
After reading the csv, I set support=0.01 as threshold and confidence=0.5 to generate frequent itemset. Then the apriori algorithm shall be used.
In terms of Association rule, there are four significant parameters to measure rules which are support, confidence, lift and leverage.
Rules:
The rules is in the form X->Y which means when item X is observed, Y is observed as well. Here X is the left-hand side(LHS) and Y is the right-hand side(RHS). For example, when there is a rule: X->Y (80%), it means when X is occurred, 80% of the time Y is also happened.
Support:
The item in the database X will be calculate a support value, denote by support(X). support(X) = times that X happened/ total times. When the support value. To make sure the support we find do make sense, we set a parameter minimum support as a threshold. Only the item is frequent when its support value >= minimum support value. Note that when an itemset is frequent then any subset of it must be frequent as well. That is, support({A, B})=0.5, it means Support(A)>=0.5 and Support>=0.5 which is the basis of Apriori algorithm we shall use.
Confidence:
Confidence is a measurement of certainty or trustworthiness of a rule. The confidence (X-> Y) = support(X˄Y)/Support(X)
Lift:
Lift measures how many times more often X and Y occur together than expected if they are statistically independent of each other.
Lift(X->Y) = support(X˄Y)/Support(X) * Support(Y)
For example, when Support(X→Y) = 90% Confidence(X→Y) = 90%
Lift(X->Y) =1.0
Although the support and confidence = 90% which means the relationship is strong, the lift=1 means There is no effect on the appearance of Y whether X happened or not(Leverage=0). The X and Y are independent to each other. X and Y do not have an available rule.
Leverage:
It is a value to determine the relationship. Leverage(X→Y)= support(X˄Y)-Support(X) * Support(Y)=Lift-1. When Leverage =0 it means X and Y are statistically independent of each other.
With those knowledge base, the solution of task3 are shown as below:
library(arules)
library(arulesViz)
data<-read.csv(“C:/Users/52441/Desktop/CSCI946/A1_release_2018/A1_success_data.csv”)#read
summary(data)#summary it
model<-apriori(data,parameter=list(support=0.001,confidence=0.5,target=“rule”))#process apriori algorithm
inspect(model)#show its detail
Output:
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.5 0.1 1 none FALSE TRUE 5 0.001 1 10 rules FALSE

Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 2

set item appearances …[0 item(s)] done [0.00s].
set transactions …[10 item(s), 2201 transaction(s)] done [0.00s].
sorting and recoding items … [10 item(s)] done [0.00s].
creating transaction tree … done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing … [139 rule(s)] done [0.00s].
creating S4 object … done [0.00s].
2.
model_rhssuccess<-subset(model, (rhs %in% paste0(“Success=”, unique(data$Success))))#show the rules which rhs==“success=”
inspect(model_rhssuccess)#show its detail
Due to the result being too long, it will be shown in the end of this report.
3.

plot([email protected]) #draw the plot with support,confidence,lift and count
Showing the relationship among support, confidence and Lift

Plot3.1: Scatterplot Matrix on The Support, Confidence and Lift.

model_visual_lift<- head(sort(model, by=“lift”),5) # find the 5 strongest rules sorted by lift
plot(model_visual_lift, method=“graph”) #show the 5 strongest rules in the graph.

Plot3.2: Graph visualization of the top 5 Rules sorted by lift

Based on above we can see that there are some available rules (We pick Lift>3.0 and support>0.03):
Left-hand side Right-hand side Support Confidence Lift Count
[43] {Grade=2nd,Success=Yes} => {Sex=Female} 0.042253521 0.7881356 3.6908222 93
[53] {Grade=1st,Success=Yes} => {Sex=Female} 0.064061790 0.6945813 3.2527094 141
[107] {Grade=2nd,Enrol=Undergrad,Success=Yes} => {Sex=Female} 0.036347115 0.8510638 3.9855138 80
[115] {Grade=1st,Enrol=Undergrad,Success=Yes} => {Sex=Female} 0.063607451 0.7106599 3.3280052 140