Project 3

Project description

Project

R offers many solutions for data frame manipulation. One of the packages providing such solutions is dplyr. However, the syntax of this packages is quite different from the syntax of similar solutions in other languages, and in particular from the syntax of pandas package in Python. This project aims to provide a solution to use all constructions offered in the dplyr package but with a syntax similar to the pandas package in Python.

Specifically, this project's main goal is implementing a class pd that allows all typical constructions of the dplyr package with the syntax similar to pandas in Python.

First of all, the class pd lets a user define a data set, like in the following example.

### Creating a pd object
fileURL <- "http://michal.ramsza.org/lectures/2_r_programming/data/data_2.csv"
a <- pd$new(data = read.csv(file = fileURL)) 

The object a contains a data set that itself is a data.frame object. The pd class provides a method convert() that toggles the class of a data set between data.frame and df_tbl (tibble). Also, there is a method value() that returns the data set itself. The following example shows the application of these methods.

### Changing the class of a data set
class(a$value())
a$convert()
class(a$value())
a$convert()
head(a$value())

[1] "data.frame"
[1] "tbl_df"     "tbl"        "data.frame"
  Year_prod    Gas_type Engine_capacity Mileage  Price      Brand
1      2005      Diesel            2500  280000  20990       Audi
2      2009      Diesel            2200  170000  38000      Honda
3      2011      Diesel            1600   96000  22600 Volkswagen
4      2004 Benzyna+LPG            2494  198230  15000        BMW
5      2006      Diesel            2000  256000  16900 Volkswagen
6      2012     Benzyna            4000   67000 140000     Toyota
            Model  Voivodeship             City Negotiation
1           A4 B7 Dolnośląskie        Wałbrzych           0
2     Accord VIII  Mazowieckie            Płock           1
3          Polo V      Śląskie             Żory           0
4     Seria 3 E46  Mazowieckie         Warszawa           1
5       Passat B6      Śląskie            Bytom           1
6 Land Cruiser VI     Lubuskie Kostrzynnad Odrą           1

In the above example, the data set class is changed to tbl_df and then back to data.frame. Finally, the head of the data set is printed.

The pd class lets the user apply all constructions of the dplyr package with a different syntax with the method op(). The method op() takes a single argument that is a dplyr construction. The following example shows a simple application of this method.

### Data operations
a$op(select(Mileage, Price, Brand))
a$op(filter(Mileage < 50000))
a$op(filter(Brand %in% c("Honda", "Fiat")))

### Creating a simple figure
ggplot(data = (a$value()), mapping = aes(x = Price)) +
    geom_histogram(bins = 100, aes(fill = Brand, color = Brand, y = ..density..)) +
    geom_density(fill = "blue", alpha = 0.2) + 
    facet_wrap(~ Brand, nrow = 2)

### Writing to file / skip in your solution
dev.copy(device = png, "./fig1.png")
dev.off()

There are three blocks in the above example. The first block of commands manipulates the data set. The second block creates a figure, and the last block (to be skipped in the solution) saves the figure to a file.

fig1.png

Figure 1: Histograms created in the above example

The class pd lets the user chain dplyr commands and resets the data set to the initial state with the reset() method. The following example shows both these operations.

### Reseting the data set to the initial state
a$reset()

### Chaining dplyr commands
a$op(select(Mileage, Price, Brand))$op(filter(Mileage < 50000))$op(filter(Brand %in% c("Toyota", "Fiat")))

### Creating histograms
ggplot(data = (a$value()), mapping = aes(x = Price)) +
    geom_histogram(bins = 100, aes(fill = Brand, color = Brand, y = ..density..)) +
    geom_density(fill = "blue", alpha = 0.2) + 
    facet_wrap(~ Brand, nrow = 2)

### Writing to file / skip in your solution
dev.copy(device = png, "./fig2.png")
dev.off()

The above example creates the following figure (note different brands).

fig2.png

Figure 2: Histograms created with operation chaining

The described implementation uses the R6 object model. However, for convenience, I added the S3 print method.

### Reseting the data set
a$reset()

### Computations
a$op(select(Brand, Mileage, Price))$op(filter(Price <= 500))
a$op(arrange(Brand, desc(Price)))

### Print method
a
   Brand Mileage Price
1 Daewoo  309376   500
2   Fiat  145500   400
3   Opel  296000   500
4 Toyota  260000   500
5 Toyota  312415   450

As you can see, I used the R6 object model. However, you can use any object model you like.

Technical conditions

The project should be solved in a single R file. The file should contain the solution (implementation) and examples of use (you can use the above examples). Solutions without examples will not be excepted. You are not allowed to use any additional packages except for dplyr, ggplot2 for plots and R6 if you choose to implement the solution with the R6 object model. Please, do not use polish diacritics (or any other).

Date: 2021-05-04 Tue 00:00

Author: Michał Ramsza

Created: 2021-05-04 Tue 20:34

Validate