========================================================

Overview

This anaylysis report is described what are important variables for wine quality. wineQualityReds.csv is consist of 13 columns and 1599 row, and each row means one wine

Understand the meaning of each columns

We need to understand what the meaning of each column.

  1. fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  2. volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  3. citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  4. residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  5. chlorides: the amount of salt in the wine
  6. free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  7. total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  8. density: the density of water is close to that of water depending on the percent alcohol and sugar content
  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  10. sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
  11. alcohol: the percent alcohol content of the wine
  12. quality (score between 0 and 10)

Cleaning data

quality column issue

Quality column can be cleaned as min is 3, and max is 8. For easy understanding, we can adjust it, min is 1 and max is 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   3.636   4.000   6.000

Quality as a factor

Change quality to factor from int, is better for analysis below

##   1   2   3   4   5   6 
##  10  53 681 638 199  18

Univariate Plots Section

Basic distribution of wine/quality

Here is a basic distribution of wine as per its quality, before we do further analysis for multiple variables

  • Most of wine quality is 3 or 4
  • Small number of win quality is 5 or 6
    • Quality 1 : 10
    • Quality 2 : 53
    • Quality 3 : 681
    • Quality 4 : 638
    • Quality 5 : 199
    • Quality 6 : 18

Basic bar chart of wine/pH

Basic distribution of pH, before we analyze further regarding acidity

  • We will revisit the relationship between quality and each acidity

Historgram of chlorides

This is historgram of chlorides, to know basic distribution of chlorides.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Historgram of sulphates

This is boxplot of sulphates/quality, to know basic distribution of sulphates as per quality.

## Warning: Ignoring unknown parameters: binwidt
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Univariate Analysis

Structure of dataset

wineQualityReds.csv is consist of around 1600 wines and its 12 attributes. I see that most of wines are 3 and 4 quality.

Analysis / distribution of wine’s quality

  • Mose of wine’s quality is 3 and 4
  • The number of other wine’s quality are small count than quality 3 and 4

Boxplot of pH/quality

  • Most fo wine’s are in the range of pH 3.2 ~ 3.4

Historgram of sulphates

  • Most of wine’s chlorides is less than 0.1

Boxplot of sulphates

  • Many wine’s sulphates is less than 0.75

Bivariate Plots Section

There are some attributes which we don’t analyze yet, they are chlorides, sulphates, alcohol and density.

Summary of chlorides as per quality

According summary() of chlorides as per quality, it seems that it is relaive with quality

Summary of sulphates as per quality

According summary() of sulphates as per quality, it seems that it is relaive with quality

Relationship between chlorides and sulphates

Here is a relationship between chlorides and sulphates, it seems that there are some outlier.

  • Most of wine’s sulphates is < 1.0
  • Most of wine’s chlorides is < 0.15

Before we remove outliers, it seems that chlorides and sulphates have weak positive relationship.

cor.test(wine$chlorides, wine$sulphates)
## 
##  Pearson's product-moment correlation
## 
## data:  wine$chlorides and wine$sulphates
## t = 15.978, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3282127 0.4127694
## sample estimates:
##       cor 
## 0.3712605

After we remove outliers, however, we can’t see particular relationship from scattrer plot, between chlorides and sulphates. It is not easy to observe the difference as per quality. It will be revisited in ‘Final plot’

We calculate again without outliers, now, we can see that chlorides and sulphates have no relationship.

wine2 = subset(wine, (chlorides < 0.15) & (sulphates < 1))

cor.test(wine2$chlorides, wine2$sulphates)
## 
##  Pearson's product-moment correlation
## 
## data:  wine2$chlorides and wine2$sulphates
## t = -0.47664, df = 1496, p-value = 0.6337
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06293036  0.03834876
## sample estimates:
##        cor 
## -0.0123224

Distribution as per volatile.acidity and qualtiy

As we can see plot below, volatile.acidity have a negative relationship with quality.

Distribution as per fixed.acidity and qualtiy

As we can see plot below, fixed.acidity have no particular relationship with quality.

Distribution as per citric.acid and qualtiy

As we can see plot below, citric.acid have a positive relationship with quality.

Distribution as per volatile.acidity/citric.acid and qualtiy

As we observed that volatile.acidity and citric.acid have certain relationship(negative and positive) with quality I think that we can put them all together in one plot

  • As volatile.acidity is negative relationship, X -1, to point upper-right, just to easy to understand

  • Adding boxplot to know citric.acid movement as per higher quality

Relationship between alcohol and density

As alcohol is getting higher, density is getting lower

cor.test(wine$alcohol, wine$density)
## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

To observe that same negative relationship is observed after it is grouped by quality * We could observe that same negative relationship in all of quality

To observe cleary as per grouped by quality

Calculated correlation of alcohol/density, as per quality

w6 = subset(wine, quality == 6)
print('correlation alcohol/density of Quality 6')
## [1] "correlation alcohol/density of Quality 6"
cor.test(w6$alcohol, w6$density)
## 
##  Pearson's product-moment correlation
## 
## data:  w6$alcohol and w6$density
## t = -2.5897, df = 16, p-value = 0.01975
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8058649 -0.1026367
## sample estimates:
##       cor 
## -0.543465
w5 = subset(wine, quality == 5)
print('correlation alcohol/density of Quality 5')
## [1] "correlation alcohol/density of Quality 5"
cor.test(w5$alcohol, w5$density)
## 
##  Pearson's product-moment correlation
## 
## data:  w5$alcohol and w5$density
## t = -10.038, df = 197, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6668522 -0.4815944
## sample estimates:
##       cor 
## -0.581718
w4 = subset(wine, quality == 4)
print('correlation alcohol/density of Quality 4')
## [1] "correlation alcohol/density of Quality 4"
cor.test(w4$alcohol, w4$density)
## 
##  Pearson's product-moment correlation
## 
## data:  w4$alcohol and w4$density
## t = -16.484, df = 636, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5992936 -0.4903238
## sample estimates:
##        cor 
## -0.5471226
w3 = subset(wine, quality == 3)
print('correlation alcohol/density of Quality 3')
## [1] "correlation alcohol/density of Quality 3"
cor.test(w3$alcohol, w3$density)
## 
##  Pearson's product-moment correlation
## 
## data:  w3$alcohol and w3$density
## t = -7.8593, df = 679, p-value = 1.518e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3561655 -0.2183697
## sample estimates:
##        cor 
## -0.2887623
w2 = subset(wine, quality == 2)
print('correlation alcohol/density of Quality 2')
## [1] "correlation alcohol/density of Quality 2"
cor.test(w2$alcohol, w2$density)
## 
##  Pearson's product-moment correlation
## 
## data:  w2$alcohol and w2$density
## t = -3.3461, df = 51, p-value = 0.001544
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6231186 -0.1739388
## sample estimates:
##       cor 
## -0.424285
w1 = subset(wine, quality == 1)
print('correlation alcohol/density of Quality 1')
## [1] "correlation alcohol/density of Quality 1"
cor.test(w1$alcohol, w1$density)
## 
##  Pearson's product-moment correlation
## 
## data:  w1$alcohol and w1$density
## t = -1.5924, df = 8, p-value = 0.15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8558534  0.2011769
## sample estimates:
##        cor 
## -0.4905907

This plot is to know alcohol/density as per quality, including boxplot of alcohol, to compare alcohol as per quality

ggplot(wine, aes(x=density, y=alcohol, color=quality)) +
  facet_wrap(~quality) +
  geom_point(alpha=0.5) +
  geom_boxplot(alpha=0.1) +
  ylab('Alcohol [%]') +
  xlab('Density (g / cm^3)')


Bivariate Analysis

Main interest in dataset

We are going to analysis how each columns impact on wine quality, and relationship between some columns.

  • There are three acidity, volatile.acidity, fixed.acidity and citric.acid.
  • In the plots above, each acidity impact on quality differently.
  • Relationship between chlorides and sulphates
  • Relationship between alcohol and density

sulphates and quality

Higher sulphates is observed in higher quality wine

chlorides and quality

lower chlorides is observed in higher quality wine

Relationship between chlorides and sulphates

  • We might confused that chlorides and sulphates have weak positive relationship
  • However, after we removed outliers, there is no particular relation between them

Hypothesis and initial guessing regarding acidity and quality

Before I analyze ‘wineQualityReds.csv’ data, my initial guessing is like below

  • As per increasing acidity, quality is increased
  • If acidity is over certain level, its impact is changed to negative

Hoever, the impact on quality is different as per each acidity. Even one of them, don’t impact on quality

Analyzed relationship between quality and each acidity

Unlike initial guessing, each acidity have different impact on quality.

  • volatile.acidity, have a negative relationship with quality
  • fixed.acidity, seems that it don’t impact on quality
  • citric.acid, have a positive relationship with quality

Analyzed distribution/relationship as per volatile.acidity/citric.acid and qualtiy

  • All quality of wine, volatile.acidity/citric have negative relationship

Unusal distribution of fixed.acidity/quality

Other acidity(volatile.acidity and citric.acid) have certain relationship regarding quality. Even fixed.acidity is one of acidity, it seems that it is not relative with quality

Relationship between alcohol and density

  • It is observed that alcohol and density have negative relationship
  • Quality 6 have higher alcohol than other quality

Multivariate Plots Section

Previous bivariate analysis, we couldn’t find particular relationship between chlorides and sulphates. However, it seems that they are still relative with quality. So, we are going to figure it out.

How chlorides and sulphates impact on quality

If we group chlorides and sulphates as per quality, We could observe that there is a relationship regarding chlorides/sulphates and qulaity.

As we summary chlorides after we group by wine quality, we observe that higher qulity wine have less chlorides, and its min/max is getting smaller.

## [1] "chlorides summary of Quality 6"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04400 0.06500 0.07100 0.06888 0.07600 0.08600
## [1] "chlorides summary of Quality 5"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07200 0.07442 0.08700 0.14300
## [1] "chlorides summary of Quality 4"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.06800 0.07800 0.07808 0.08600 0.13200
## [1] "chlorides summary of Quality 3"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03900 0.07400 0.08000 0.08245 0.09000 0.14600
## [1] "chlorides summary of Quality 2"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06700 0.07900 0.07902 0.08800 0.14700
## [1] "chlorides summary of Quality 1"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06100 0.07700 0.08300 0.09475 0.10700 0.14500

On the contrary, if sulphate goes higher, as quality is higher.

summary(wine_6$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7482  0.8200  0.9200
summary(wine_5$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7300  0.7241  0.8100  0.9900
summary(wine_4$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5700  0.6400  0.6569  0.7300  0.9900
summary(wine_3$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3700  0.5200  0.5800  0.5919  0.6400  0.9900
summary(wine_2$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5496  0.5900  0.8600
summary(wine_1$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.5850  0.8600

Distribution of chlorides/sulphates as per quality

According two plot above, it seems that chlorides/sulphates have a relationship with quality. This plot is to show it together. However, it is still not easy to know it from plot, we will revisit it in ‘Final plot’

  • As chlorides is negative relationship with quality, we multiply * -1
  • It is to located wines, sulphates to upper-right as it have lower chlorides

  • Still not easy to observe relationship between quality and chlorides/sulphates
  • To observe more, it is grouped by quality
  • It seems that they have some distribution as per quality

  • To observe trend of chlorides/sulphates, add trend line by using geom_smooth()
  • Observed that inclination is changing as per quality
  • it is changed to positive as per higher quality


Multivariate Analysis

Analysis/Distribution of chlorides/sulphates as per quality

The trend of chlorides/sulphates is changed as per quality. As quality is goes higher, it is moved to positive from negative


Final Plot and Summary

Wine quality and volatile.acidity/citric.acid

  • Quality goes higher, both volatile.acidity/citric.acid goes to narrow range
  • Quality 3, 4 have wide range of volatile.acidity/citric.acid

  • As volatile.acidity is negative relationship with quality, we multiply * -1, to locate to upper-right

Wine quality and alcohol/density

  • Higher alcohol percetage in quality 6
  • Quality 3, 4 and 5, have wide range of density/alcohol

Wine quality and chlorides/sulphates

  • Now, we can see that as per qulity goes higher, the relationship of chlorides/sulphates is changing to positively
  • From the six plot below, we can know that chlorides/sulphates should be mixed a certation ratio to get better quality

  • To observe it, we cleaned data like below
  • Remove outliers like below.
    • ‘1st Qu’ < chlorides < ‘3rd Qu’
    • ‘1st Qu’ < sulphates < ‘3rd Qu’
  • As chlorides is negative relationship with quality, we multiply * -1
    • It is to located higher chlorides/sulphates to upper-right

Reflection

Ratio of chlorides/sulphates for better quality

  • Hope to understand more regarding ratio of chlorides/sulphates to be better quality
  • However, the number of higher quality wine, is too smal to analyze it

Hard to find rule of quality

  • We found some point what make wine’s quality is better
  • Not easy to figure out particular rule of wine’s quality
  • It seems there is a kind of combination of attributes to define quality

Hard to draw plot what I want

  • In Final Plot, I hope to draw line between mean/median as per different quality
  • After I spent over few hours, I realized that it is not possible