| 
            
       | 
      
       
      (linked!)    
      Here are the in-class examples
      to make it easy to follow along...
      
  
      In particular, after each comment is a block of examples -- you'll be able
      to copy-and-paste those into R's console in order to run them. You
      may need to install packages as they appear, but this hasn't caused a 
      problem so far... .
       
       
      
      As usual, for this assignment, please submit a zipped folder named hw5.zip
      that contains a few files: one named pr1.txt that includes
      your interactive session (or history)
      working through chapter 12 of the Data Science book. 
      (String-processing to analyze Tweets)
       
      Here is a full list of the files, including some extra-credit options:
       
        -  pr1.txt should be your history (or console)
        interactions for the book's Chapter 12. This adds string-processing
        to your modeling of Twitter feeds -- including extracting retweets, 
        URLs and (optionally) hashtags.
        
  In addition, please include a file named pr1.R, which 
        should have your definition for an R function named retweeters,
        which should take in a data frame of Tweets as input and it should output
        a list of unique source names of retweeted messages. This is the "challenge"
        posed at the end of the chapter.
        
   [Optional] If you'd like to try the suggestion extension
        at the end of the chapter, you're welcome to! It's totally optional
        and is worth up to +5 points of extra credit. It's to write a function, call
        it hashes, that takes in a dataframe of Tweets and returns a list
        of all of the unique hashtags found in those Tweets, along with the
        number of times each one was found. This can be a list of two vectors or, perhaps more
        naturally, a data frame as a result.
         
        
  
        -  For problem 2, you will want to return to the Titanic-survivor dataset of 
        a few weeks ago. It is linked here in train742.csv.
        For this week's assignment, however, you should use R's tree package
        to create a tree-based predictive model in pr2.R for whether or not a passenger
        would have survived the sinking of the Titanic. You should include an R file
        that contains your model, along with a pr2.txt or pr2.doc file
        that describes how you arrived at your model.
        
  
        Your analysis should include some of the important facets we touched on in 
        week 5's class, including
        
          -  a tree that is based on a subset of the Titanic variables (you may use them all or
          you may cull them away or convert them to a more suitable form)
          
 
          -  your tree should be checked by cross-validation in order to determine
          how many leaves it could usefully have in order to void overfitting the data
          
 
          -  your tree should be pruned to the size you choose (based on the cross-validation
          analysis)
          
 
          -  You should include a table of how many of the 742 test observations are correctly
          (and incorrectly) classified by the tree
          
 -  We will run your tree on both that test data and some additional training data
          that's not included in the test set... .
          
 
         
         
        
  
        - 
        [Optional]   
        The third problem this week is also entirely optional (and worth up to +5 points of
        extra-credit, similar to the hashes function). It is to build a logistic
        regression model based on the Titanic-survival dataset that results in another
        predictor for Titanic survival. 
  
        My hunch is that the tree-based models will work better than the logistic regression, but
        we'll see if that's really true. Plus, there is certainly no requirement that
        your logistic model work better than the tree-based models (but it should work better
        than pure chance!)
         
       
       |