当前位置: 首页>编程语言>正文

AD699- Data Mining for Business Analytics

my wechat:Yooo932851

Don't hesitate to contact me

Task 1: Association rules

For this portion of the assignment, we will be using data from Groceries, a dataset that can be found with thearules package. Each row in the file represents one buyer’s purchases. This link provides some helpful templated examples for generating association rules.

1. Describe “Groceries” by answering following questions:

● What is the class of “Groceries”

● How many rows and columns does Groceries contain

2. Generate an item frequency barplot for the grocery items that depicts the 12 most common grocery items from the dataset. Include a screenshot of your results, along with the code you used to do this. Fill the bars with any color of your choice. This plot should be oriented vertically (the default way).

3. Now, create a subset of rules that containyour grocery item (you can find your item in the spreadsheet in Blackboard). Select anyone rule with your item on the left-hand side, and anyone rule with your item on the right-hand side, and explain them in the way you would explain them to your roommate (I’m assuming your roommate is a smart person who is unfamiliar with data mining).Remember, every rule has four components: support, coverage, confidence, and lift.

For each of your chosen rules (your grocery item on the left-hand side, and your grocery item on the right-hand side), include a screenshot of your rules, along with the code you used to generate the rules.

4. In a sentence or two, explain what meaning these rules might have for a store like Star Market. What could it do with this information

5. Using the plot() function in the arulesViz package, generate a scatter plot ofany three rules involving your grocery item. Include a screenshot of your plot, along with the code you used to generate the plot. Describe your results in a sentence or two.

6. Again using the plot() function in the arulesViz package, generate a plot for any three of your rules. This time, add two more arguments to the function: method="graph", engine="htmlwidget". What do you see nowInclude a screenshot of your plot, along with the code you used to generate the plot. Describe your results in a sentence or two.

Task 2: Classification Tree

1. Bring the datasetParticipation from the Ecdat package into your R environment. Use the or help() function to learn more about its variables. What doeslfp mean

2. Using your assigned seed value (from Assignment 2), partition your data into training (60%) and validation (40%) sets. Show the step(s) that you used to do this.

3. Build a tree model with this dataset, usinglfp as your outcome variable.

4. Use rpart.plot to display a classification tree that depicts your model. (If the tree model is hard to see or read, that’s okay...but if you wish to change it, you can try changing some settings, such as cex, fallen.leaves, and varlen).

a. Then, adjust the way your model looks. Don’t change anything about the model itself, but use a new combination of values for ‘type’ and/or ‘extra’ in rpart.plot to change the appearance of the tree.

b. Try yet another alternative way of viewing your model. Show your results.

c. Now, write a couple of sentences about what you saw with each of the three graphical versions of your model. Which one do you like best, and why

6. Describe the split that’s created at your tree’s root node (what variable did it split on, and what rule did it use?). Why is the root node significant

7. Did all the input variables from the dataset appear in your model diagramIf not, why not

8. Describe any one rule that your tree generates regarding whether a worker in Switzerland will participate in the labor force. To describe a rule, just trace any path along your tree from the root node to a terminal node.

9. Now, build another tree model. This time, set a complexity parameter of 0, and use minsplit =2, to make the tree as large as possible. Show what your overfit tree looks like, using rpart.plot. Don’t worry about interpreting this tree – just show it.

10. Using five-fold cross-validation, determine the optimal complexity parameter (cp) for a tree model built with your training data. Demonstrate this by showing your cptable and stating which cp value you chose.

11. Generate a new tree model, with the cp value that you found previously.

12. Use rpart.plot to show your new tree model (the pruned tree). Show this with your preferred “type” and “extra” settings in rpart.plot.?

12a. Create confusion matrices in R to assess the performance of your huge tree against your training and validation sets. How did it perform

b. Now, create confusion matrices to assess your optimally-sized tree model (the one that you built after cross-validation). How was this optimally-sized model’s performance against the training and validation setsWhat happened to the diff erence between the two accuracy values as you went from the huge tree to the optimal one

c. Why would it be reasonable to expect that the diff erence between training set accuracy and validation set accuracy would decrease when using a pruned tree


https://www.xamrdz.com/lan/5ws1995714.html

相关文章: