Analysis with Programming

Posts

Showing posts with the label Python

R and Python: Gradient Descent

One of the problems often dealt in Statistics is minimization of the objective function. And contrary to the linear models, there is no analytical solution for models that are nonlinear on the parameters such as logistic regression, neural networks, and nonlinear regression models (like Michaelis-Menten model). In this situation, we have to use mathematical programming or optimization. And one popular optimization algorithm is the gradient descent , which we're going to illustrate here. To start with, let's consider a simple function with closed-form solution given by \begin{equation} f(\beta) \triangleq \beta^4 - 3\beta^3 + 2. \end{equation} We want to minimize this function with respect to $\beta$. The quick solution to this, as what calculus taught us, is to compute for the first derivative of the function, that is \begin{equation} \frac{\text{d}f(\beta)}{\text{d}\beta}=4\beta^3-9\beta^2. \end{equation} Setting this to 0 to obtain the stationary point gives us \begin{align}...

R and Python: Theory of Linear Least Squares

In my previous article , we talked about implementations of linear regression models in R, Python and SAS. On the theoretical sides, however, I briefly mentioned the estimation procedure for the parameter $\boldsymbol{\beta}$. So to help us understand how software does the estimation procedure, we'll look at the mathematics behind it. We will also perform the estimation manually in R and in Python, that means we're not going to use any special packages, this will help us appreciate the theory. Linear Least Squares Consider the linear regression model, \[ y_i=f_i(\mathbf{x}|\boldsymbol{\beta})+\varepsilon_i,\quad\mathbf{x}_i=\left[ \begin{array}{cccc} 1&x_{11}&\cdots&x_{1p} \end{array}\right],\quad\boldsymbol{\beta}=\left[\begin{array}{c}\beta_0\\\beta_1\\\vdots\\\beta_p\end{array}\right], \] where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,\cdots, N$. The $f_i(\mathbf{x}|\boldsymbol{\beta})$ is the deterministic part of the model tha...

R, Python, and SAS: Getting Started with Linear Regression

Consider the linear regression model, $$ y_i=f_i(\boldsymbol{x}|\boldsymbol{\beta})+\varepsilon_i, $$ where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,\cdots, N$ and the predictor or the independent variable is the $\boldsymbol{x}$ term defined in the mean function $f_i(\boldsymbol{x}|\boldsymbol{\beta})$. For simplicity, consider the following simple linear regression (SLR) model, $$ y_i=\beta_0+\beta_1x_i+\varepsilon_i. $$ To obtain the (best) estimate of $\beta_0$ and $\beta_1$, we solve for the least residual sum of squares (RSS) given by, $$ S=\sum_{i=1}^{n}\varepsilon_i^2=\sum_{i=1}^{n}(y_i-\beta_0-\beta_1x_i)^2. $$ Now suppose we want to fit the model to the following data, Average Heights and Weights for American Women , where weight is the response and height is the predictor. The data is available in R by default.

Parametric Inference: The Power Function of the Test

In Statistics, we model random phenomenon and make conclusions about its population. For example, in an experiment of determining the true heights of the students in the university. Suppose we take sample from the population of the students, and consider testing the null hypothesis that the average height is 5.4 ft against an alternative hypothesis that the average height is greater than 5.4 ft. Mathematically, we can represent this as $H_0:\theta=\theta_0$ vs $H_1:\theta>\theta_0$, where $\theta$ is the true value of the parameter, and $\theta_0=5.4$ is the testing value set by the experimenter. And because we only consider subset (the sample) of the population for testing the hypotheses, then we expect for errors we commit. To understand these errors, consider if the above test results into rejecting $H_0$ given that $\theta\in\Theta_0$, where $\Theta_0$ is the parameter space of the null hypothesis, in other words we mistakenly reject $H_0$, then in this case we committed a Type ...

Parametric Inference: Likelihood Ratio Test by Example

Hypothesis testing have been extensively used on different discipline of science. And in this post, I will attempt on discussing the basic theory behind this, the Likelihood Ratio Test (LRT) defined below from Casella and Berger (2001), see reference 1. Definition . The likelihood ratio test statistic for testing $H_0:\theta\in\Theta_0$ versus $H_1:\theta\in\Theta_0^c$ is \begin{equation} \label{eq:lrt} \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\theta\in\Theta_0}L(\theta|\mathbf{x})}{\displaystyle\sup_{\theta\in\Theta}L(\theta|\mathbf{x})}. \end{equation} A likelihood ratio test (LRT) is any test that has a rejection region of the form $\{\mathbf{x}:\lambda(\mathbf{x})\leq c\}$, where $c$ is any number satisfying $0\leq c \leq 1$. The numerator of equation (\ref{eq:lrt}) gives us the supremum probability of the parameter, $\theta$, over the restricted domain (null hypothesis, $\Theta_0$) of the parameter space $\Theta$, that maximizes the joint probability of the sample, $\math...

Python and R: Basic Sampling Problem

In this post, I would like to share a simple problem about sampling analysis. And I will demonstrate how to solve this using Python and R. The first two problems are originally from Sampling: Design and Analysis book by Sharon Lohr. Problems Let $N=6$ and $n=3$. For purposes of studying sampling distributions, assume that all population values are known. $y_1 = 98$ $y_2 = 102$ $y_3=154$ $y_4 = 133$ $y_5 = 190$ $y_6=175$ We are interested in $\bar{y}_U$, the population mean. Consider eight possible samples chosen. Sample No. Sample, $\mathcal{S}$ $P(\mathcal{S})$ 1 $\{1,3,5\}$ $1/8$ 2 $\{1,3,6\}$ $1/8$ 3 $\{1,4,5\}$ $1/8$ 4 $\{1,4,6\}$ $1/8$ 5 $\{2,3,5\}$ $1/8$ 6 $\{2,3,6\}$ $1/8$ 7 $\{2,4,5\}$ $1/8$ 8 $\{2,4,6\}$ $1/8$

Probability Theory: Convergence in Distribution Problem

Let's solve some theoretical problem in probability, specifically on convergence. The problem below is originally from Exercise 5.42 of Casella and Berger (2001). And I just want to share my solution on this. If there is an incorrect argument below, I would be happy if you could point that to me. Problem Let $X_1, X_2,\cdots$ be iid (independent and identically distributed) and $X_{(n)}=\max_{1\leq i\leq n}x_i$. If $x_i\sim$ beta(1,$\beta$), find a value of $\nu$ so that $n^{\nu}(1-X_{(n)})$ converges in distribution; If $x_i\sim$ exponential(1), find a sequence $a_n$ so that $X_{(n)}-a_n$ converges in distribution. Solution Let $Y_n=n^{\nu}(1-X_{(n)})$, we say that $Y_n\rightarrow Y$ in distribution. If $$\lim_{n\rightarrow \infty}F_{Y_n}(y)=F_Y(y).$$ Then, $$ \begin{aligned} \lim_{n\rightarrow\infty}F_{Y_n}(y)&=\lim_{n\rightarrow\infty}P(Y_n\leq y)=\lim_{n\rightarrow\infty}P(n^{\nu}(1-X_{(n)})\leq y)\\ &=\lim_{n\rightarrow\infty}P\left(1-X_{(n)}\leq \frac{y}{n^{...

Python: Getting Started with Data Analysis

Analysis with Programming has recently been syndicated to Planet Python . And as a first post being a contributing blog on the said site, I would like to share how to get started with data analysis on Python. Specifically, I would like to do the following: Importing the data Importing CSV file both locally and from the web; Data transformation; Descriptive statistics of the data; Hypothesis testing One-sample t test; Visualization; and Creating custom function. Importing the data This is the crucial step, we need to import the data in order to proceed with the succeeding analysis. And often times data are in CSV format, if not, at least can be converted to CSV format. In Python we can do this using the following codes:

Probability Theory Problems

Let's have fun on probability theory, here is my first problem set in the said subject. Problems It was noted that statisticians who follow the deFinetti school do not accept the Axiom of Countable Additivity, instead adhering to the Axiom of Finite Additivity. Show that the Axiom of Countable Additivity implies Finite Additivity. Although, by itself, the Axiom of Finite Additivity does not imply Countable Additivity, suppose we supplement it with the following. Let $A_1\supset A_2\supset\cdots\supset A_n\supset \cdots$ be an infinite sequence of nested sets whose limit is the empty set, which we denote by $A_n\downarrow\emptyset$. Consider the following: Axiom of Continuity: If $A_n\downarrow\emptyset$, then $P(A_n)\rightarrow 0$ Prove that the Axiom of Continuity and the Axiom of Finite Additivity imply Countable Additivity. Prove each of the following statements. (Assume that any conditioning event has positive probability.) If $P(B)=1$, then $P(A|B)=P(A)$ f...

R and Python Meetups, Philippines

There will be upcoming meet ups for R User Group Philippines and Python Philippines (PythonPH) Community. Below are the details: R Meetup topic: R for SAS users, and planning of RUG activities venue: 9/F Sun Life Centre, 5th Avenue corner Rizal Drive, Bonifacio Global City, 1634, Taguig date: Thursday, June 19, 2014 7:00 pm outline: Introducing R to SAS users; common SAS functions used at PPD - c/o Mark Javellosa; group discussion on equivalent packages in R; and, Sharing of experiences of actual SAS converts. Question? Ask here .

Python: Numerical Descriptions of the Data

We are going to explore the basics of Statistics using Python. And we'll go through the following: Importing the data; Apply summary statistics; Other measures of variability (variance and coefficient of variation); Other measures of position (percentile and decile); Estimate the Skewness and Kurtosis; and bonus, Visualize the histogram; Data -- volume of palay (rice) production from five regions (Abra, Apayao, Benguet, Ifugao, and Kalinga) of the central Luzon, Philippines. To import this, execute the following: To check the first and last five entries of the data, use head() and tail() methods, respectively; and to apply the summary statistics, use the describe() method,

Python and R: Is Python really faster than R?

A friend of mine asked me to code the following in R: Generate samples of size 10 from Normal distribution with $\mu$ = 3 and $\sigma^2$ = 5; Compute the $\bar{x}$ and $\bar{x}\mp z_{\alpha/2}\displaystyle\frac{\sigma}{\sqrt{n}}$ using the 95% confidence level; Repeat the process 100 times; then Compute the percentage of the confidence intervals containing the true mean. So here is what I got, Staying with the default values, one would obtain The output is a list of Matrix and Decision , wherein the first column of the first list ( Matrix ) is the computed $\bar{x}$; the second and third columns are the lower and upper limits of the confidence interval, respectively; and the fourth column, is an array of ones -- if true mean is contained in the interval and zeros -- true mean not contained. Now how fast it would be if I were to code this in Python?

Book Review: Learning Geospatial Analysis with Python by Joel Lawhead

I decided to read this book since I've been doing maps using R . Hence it is better to learn the literature and science behind mapping and how to do a proper analysis on it. In addition, I would like to see what Python can offer in this discipline. The book has 10 chapters contained in a 364 pages. The first three chapters was a long reading, not much on coding, but rather on discussions of introduction to Geospatial Analysis. Impression: I like the idea that the author spent three chapters talking about the overall story (I would say) of Geospatial Analysis. Just a preview, the first chapter is of course the introduction; second is the data types, which surprisingly has a variety of formats; and third is all about the libraries and packages used in the said study. I am familiar with ArcGIS and QGIS , but this book lets you aware with other tools as well. The simple illustration that complements the discussion is very helpful in telling the overall story of the subject. ...

Book Review: Practical Data Analysis by Hector Cuesta

I have been reading this book since last week, and now I want to share my thoughts about it. I was excited to review this because I've never heard most of the tools it features, like OpenRefine, MongoDB, and MapReduce. The book has 360 pages and surprisingly it covers a lot of topics. Along with that, is the Github repository for all the codes. Practical Data Analysis is all about applications of statistical methodologies on computer science. I find it very useful since this was not taught in my statistics class. In college, we only practice statistics on fields like sociology, psychology, agriculture, economics, chemistry, biology, industrial engineering, and many others, but we were not onto computer science, we only deal with it when coding in R or SAS. Hal Varian once said in this video that, . . . we've got at least hundred statisticians on Google . . . And I was curious about that, I mean, what are they doing on Google? What are the statistical tools d...

Python: Venn Diagram

Venn Diagram is very useful for visualizing operations between events/sets. So in this post, we will learn how to visualize one in Python. First, we need to install the module matplotlib-venn . Open the terminal or command prompt, and run the following code: Now that we have it, here are the three set operations I visualized: $A\cup B$ $A\cap B$ $A^c\cap B$

Python: Download and Install Spyder in Ubuntu

Python offers modules such as scipy, numpy, and pandas for data analysis. And I am going to use these as alternative to R. To get started, I recommend installing the Python IDE, Spyder. If you haven't yet installed python in your computer, don't worry as this will automatically be installed as well. Open Ubuntu Software Center Search for Spyder Click Install Once successfully installed, open it and try running some arithmetic on the console. Or try the script window and press F5 to execute. Spyder in Ubuntu 12.10