R Programming Interview
R Programming Interview
R Programming Interview
Python vs R
CLICK HERE to get the data scientist salary report delivered to your inbox!
R Commander is used to import data in R language. To start the R commander GUI, the user must
type in the command Rcmdr into the console. There are 3 different ways in which data can be
imported in R language-
• Users can select the data set in the dialog box or enter the name of the data set (if they
know).
• Data can also be entered directly using the editor of R Commander via Data->New Data Set.
However, this works well when the data set is not too large.
• Data can also be imported from a URL or from a plain text file (ASCII), from any other
statistical package or from the clipboard.
3) Two vectors X and Y are defined as follows – X <- c(3, 2, 4) and Y <- c(1, 2).
What will be output of vector Z that is defined as Z <- X*Y.
In R language when the vectors have different lengths, the multiplication begins with the smaller
vector and continues till all the elements in the larger vector have been multiplied.
The output of the above code will be –
Z <- (3, 4, 4)
NaN (Not a Number) is used to represent impossible values whereas NA (Not Available) is used to
represent missing values. The best way to answer this question would be to mention that deleting
missing values is not a good idea because the probable cause for missing value could be some
problem with data collection or programming or the query. It is good to find the root cause of the
missing values and then take necessary steps handle them.
5) R language has several packages for solving a particular problem. How do you make a decision on
which one is the best to use?
CRAN package ecosystem has more than 6000 packages. The best way for beginners to answer this
question is to mention that they would look for a package that follows good software development
principles. The next thing would be to look for user reviews and find out if other data scientists or
analysts have been able to solve a similar problem.
6) Which function in R language is used to find out whether the means of 2 groups are equal to each
other or not?
t.tests ()
7) What is the best way to communicate the results of data analysis using R language?
The best possible way to do this is combine the data, code and analysis results in a single document
using knitr for reproducible research. This helps others to verify the findings, add to them and
engage in discussions. Reproducible research makes it easy to redo the experiments by inserting
new data and applying it to a different problem.
R language has Homogeneous and Heterogeneous data structures. Homogeneous data structures
have same type of objects – Vector, Matrix ad Array. Heterogeneous data structures have different
type of objects – Data frames and lists.
9) What is the value of f (2) for the following R code?
b <- 4
b <- 3
b^3 + g (a)
a*b
The answer to the above code snippet is 35. The value of “a” passed to the function is 2 and the
value for “b” defined in the function f (a) is 3. So the output would be 3^3 + g (2). The function g is
defined in the global environment and it takes the value of b as 4(due to lexical scoping in R) not 3
returning a value 2*4= 8 to the function f. The result will be 3^3+8= 35.
10) What is the process to create a table in R language without using external files?
MyTable= data.frame ()
edit (MyTable)
The above code will open an Excel Spreadsheet for entering data into MyTable.
Learn Data Science in R Programming to land a top gig as an Enterprise Data Scientist!
Transpose t () is the easiest method for reshaping the data before analysis.
12) What are with () and BY () functions used for?
With () function is used to apply an expression for a given dataset and BY () function is used for
applying a function each level of factors.
13) dplyr package is used to speed up data frame management code. Which package can be
integrated with dplyr for large fast tables?
data.table
14) In base graphics system, which function is used to add elements to a plot?
boxplot () or text ()
15) What are the different type of sorting algorithms available in R language?
Bucket Sort
Selection Sort
Quick Sort
Bubble Sort
Merge Sort
HDFS can be used for storing the data for long-term. MapReduce jobs submitted from either Oozie,
Pig or Hive can be used to encode, improve and sample the data sets from HDFS into R. This helps to
leverage complex analysis tasks on the subset of data prepared in R.
17) What will be the output of log (-5.8) when executed on R console?
Executing the above on R console will display a warning sign that NaN (Not a Number) will be
produced because it is not possible to take the log of negative number.
if (is.na (a))
else if (a < 0)
else
printmessage (NA)
The output for the above R programming code will be “a is a missing value.” The function is.na () is
used to check if the input passed is a missing value.
adegenet
Data frame can contain heterogeneous inputs while a matrix cannot. In matrix only similar data
types can be stored whereas in a data frame there can be different data types like characters,
integers or other data frames.
rbind () function can be used add datasets in R language provided the columns in the datasets should
be same.
8TB is the memory limit for 64-bit system memory and 3GB is the limit for 32-bit system memory.
26) What are the data types in R on which binary operators can be applied?
28) What will be the class of the resulting vector if you concatenate a number
and NA?
number
K-Nearest Neighbour is one of the simplest machine learning classification algorithms that is a subset
of supervised learning based on lazy learning. In this algorithm the function is approximated locally
and any computations are deferred until classification.
30) What will be the class of the resulting vector if you concatenate a number
and a character?
character
31) Write code to build an R function powered by C?
32) If you want to know all the values in c (1, 3, 5, 7, 10) that are not in c (1, 5,
10, 12, 14). Which in-built function in R can be used to do this? Also, how this
can be achieved without using the in-built function.
Using in-built function - setdiff(c (1, 3, 5, 7, 10), c (1, 5, 10, 11, 13))
Without using in-built function - c (1, 3, 5, 7, 10) [! c (1, 3, 5, 7, 10) %in% c (1, 5, 10, 11, 13).
34) What will be the class of the resulting vector if you concatenate a number
and a logical?
number
35) Write a function in R language to replace the missing value in a vector with
the mean of that vector.
36) What happens if the application object is not able to handle an event?
If the programmers want the output to be a data frame or a vector, then sapply function is used
whereas if a programmer wants the output to be a list then lapply is used. There one more function
known as vapply which is preferred over sapply as vapply allows the programmer to specific the
output type. The disadvantage of using vapply is that it is difficult to be implemented and more
verbose.
Seq_along(6) will produce a vector with length 6 whereas seq(6) will produce a sequential vector
from 1 to 6 c( (1,2,3,4,5,6)).
read.csv () function is used to read a .csv file in R language. Below is a simple example –
print (filecontent)
The line of code in R language should begin with a hash symbol (#).
41) How can you verify if a given object “X” is a matric data object?
If the function call is.matrix(X ) returns TRUE then X can be termed as a matrix data object.
42) What do you understand by element recycling in R?
If two vectors with different lengths perform an operation –the elements of the shorter vector will
be re-used to complete the operation. This is referred to as element recycling.
Example – Vector A <-c(1,2,0,4) and Vector B<-(3,6) then the result of A*B will be ( 3,12,0,24). Here 3
and 6 of vector B are repeated when computing the result.
43) How can you verify if a given object “X” is a matrix data object?
If the function call is.matrix(X) returns true then X can be considered as a matrix data object
otheriwse not.
44) How will you measure the probability of a binary response variable in R
language?
Logistic regression can be used for this and the function glm () in R language provides this
functionality.
Sample () function can be used to select a random sample of size ‘n’ from a huge dataset.
Subset () function is used to select variables and observations from a given dataset.
Coin package in R provides various options for re-randomization and permutations based on
statistical tests. When test assumptions cannot be met then this package serves as the best
alternative to classical methods as it does not assume random sampling from well-defined
populations.
If a developer wants to skip the current iteration of a loop in the code without terminating it then
they can use the next statement. Whenever the R parser comes across the next statement in the
code, it skips evaluation of the loop further and jumps to the next iteration of the loop.
A matrix of scatterplots can be produced using pairs. Pairs function takes various parameters like
formula, data, subset, labels, etc.
formula- A formula basically like ~a+b+c . Each term gives a separate variable in the pairs plots
where the terms should be numerical vectors. It basically represents the series of variables used in
pairs.
data- It basically represents the dataset from which the variables have to be taken for building a
scatterplot.
It can be done using the match () function- match () function returns the first appearance of a
particular element.
The other is to use %in% which returns a Boolean value either true or false.
Is.element () function also returns a Boolean value either true or false based on whether it is
present in a vector or not.
51) What is the difference between library() and require() functions in R
language?
There is no real difference between the two if the packages are not being loaded inside the function.
require () function is usually used inside function and throws a warning whenever a particular
package is not found. On the flip side, library () function gives an error message if the desired
package cannot be loaded.
52) What are the rules to define a variable name in R programming language?
A variable name in R programming language can contain numeric and alphabets along with special
characters like dot (.) and underline (-). Variable names in R language can begin with an alphabet or
the dot symbol. However, if the variable name begins with a dot symbol it should not be a followed
by a numeric digit.
The current R working environment of a user that has user defined objects like lists, vectors, etc. is
referred to as Workspace in R language.
Order ()
55) How will you list all the data sets available in all R packages?
58) How will you drop variables using indices in a data frame?
df
## v1 v2 v3 v4
## 1 1 2 3 4
## 2 2 3 4 5
## 3 3 4 5 6
## 4 4 5 6 7
## 5 5 6 7 8
Suppose we want to drop variables v2 & v3 , the variables v2 and v3 can be dropped using negative
indicies as follows-
df1<-df[-c(2,3)]
df1
## v1 v4
## 1 1 4
## 2 2 5
## 3 3 6
## 4 4 7
## 5 5 8
rnorm function generates "n" normal random numbers based on the mean and standard deviation
arguments passed to the function.
rnorm(n, mean = , sd = )
runif function generates "n" unform random numbers in the interval of minimum and maximum
values passed to the function.
mat<-matrix(rep(c(TRUE,FALSE),8),nrow=4)
sum(mat)
62) How will you combine multiple different string like “Data”, “Science”, “in”
,“R”, “Programming” as a single string “Data_Science_in_R_Programmming” ?
63) Write a function to extract the first name from the string “Mr. Tom White”.
64) Can you tell if the equation given below is linear or not ?
Emp_sal= 2000+2.5(emp_age)2
var2<- c("I","Love,"DeZyre")
var2
x<-5
if(x%%2==0)
else
## 4: else
## ^
R programming language does not know if the else related to the first ‘if’ or not as the first if() is a
complete command on its own.
This can be accomplished using the strsplit function which splits a string based on the identifier
given in the function call. The output of strsplit() function is a list.
strsplit("contact@dezyre.com",split = ".")
## [[1]]
R Base package is the package that is loaded by default whenever R programming environent is
loaded .R base package provides basic fucntionalites in R environment like arithmetic calcualtions,
input/output.
Merge () function is used to combine two dataframes and it identifies common rows or columns
between the 2 dataframes. Merge () function basically finds the intersection between two different
sets of data.
all.x - It is a logical value that specifies the type of merge. all.X should be set to true, if we want all
the observations from dataframe X . This results in Left Join.
all.y - It is a logical value that specifies the type of merge. all.y should be set to true , if we want all
the observations from dataframe Y . This results in Right Join.
all – The default value for this is set to FALSE which means that only matching rows are returned
resulting in Inner join. This should be set to true if you want all the observations from dataframe X
and Y resulting in Outer join.
70) Write the R programming code for an array of words so that the output is
displayed in decreasing frequency order.
tt <- sort(table(c("a", "b", "a", "a", "b", "c", "a1", "a1", "a1")), dec=T)
depth <- 3
tt[1:depth]
Output -
1) a a1 b
2) 3 3 2
The frequency distribution of a categorical variable can be checked using the table function in R
language. Table () function calculates the count of each categories of a categorical variable.
gender=factor(c(“M”,”F”,”M”,”F”,”F”,”F”))
table(sex)
Gender
F M
4 2
Programmers can also calculate the % of values for each categorical group by storing the output in a
dataframe and applying the column percent function as shown below -
t = data.frame(table(gender))
F 4 66.67
M 2 33.33
The cumulative frequency distribution of a categorical variable can be checked using the cumsum ()
function in R language.
Example –
gender = factor(c("f","m","m","f","m","f"))
y = table(gender)
cumsum(y)
Cumsum(y)
fm
33
73) What will be the result of multiplying two vectors in R having different
lengths?
The multiplication of the two vectors will be performed and the output will be displayed with a
warning message like – “Longer object length is not a multiple of shorter object length.” Suppose
there is a vector a<-c (1, 2, 3) and vector b <- (2, 3) then the multiplication of the vectors a*b will
give the resultant as 2 6 6 with the warning message. The multiplication is performed in a sequential
manner but since the length is not same, the first element of the smaller vector b will be multiplied
with the last element of the larger vector a.
74) R programming language has several packages for data science which are
meant to solve a specific problem, how do you decide which one to use?
CRAN package repository in R has more than 6000 packages, so a data scientist needs to follow a
well-defined process and criteria to select the right one for a specific task. When looking for a
package in the CRAN repository a data scientist should list out all the requirements and issues so
that an ideal R package can address all those needs and issues.
The best way to answer this question is to look for an R package that follows good software
development principles and practices. For example, you might want to look at the quality
documentation and unit tests. The next step is to check out how a particular R package is used and
read the reviews posted by other users of the R package. It is important to know if other data
scientists or data analysts have been able to solve a similar problem as that of yours. When you in
doubt choosing a particular R package, I would always ask for feedback from R community members
or other colleagues to ensure that I am making the right choice.
Data frames in R language can be merged manually using cbind () functions or by using the merge ()
function on common rows or columns.
mydata=data.frame(v1 = c(2,4,12,3,6))
which(mydata$v1==max(mydata$v1))
It returns 3 as 12 is the maximum value and it is at 3rd row in the variable x=v1.
A factor variable can be converted to numeric using the as.numeric() function in R language.
However, the variable first needs to be converted to character before being converted to numberic
because the as.numeric() function in R does not return original values but returns the vector of the
levels of the factor variable.
X1 = as.numeric(as.character(X))
i) Most of the calculations can be done with the help of vector so it is easy for data scientists to add
functions to a single vector without having to put them in a loop.
ii) A turning complete language that can be used for any kind of data science task whether it is in the
field of genetics, statistics or biology.
iii) Being an interpreted language , it does not require any compiler-making development of code
easier.
2) List some of your favorite functions in R programming language along with their usage.
12) How can you develop a package in R language and do version control?
This list of 100 data science interview questions is not an exhaustive one and we know that we have
not gotten all the answers here. We request the data science community to help us out with the
questions that we did not get the answers to. Please do chime in with any data science interview
questions related to R programming that you think ought to be here. We will add it in.
Answer : 2 , 6 , 4
R language does vectorized operations. ‘a’ and ‘b’ are two vectors with different length. By
process, R multiplies the first element of a with 1st element of b, than second element of a with
that of b, and so on. But in this case, after the second multiplication R hits the end of vector
“b”. In such cases R, starts with the first element of smaller vector till each element of longer
vector is exhausted. The vectorized operation always leads to a vector of length equal to that
of longer vector.
Question 2 : Scoping Rules
You need to understand the following code and answer a question based on this understanding.
> y <- 3
> f <- function(x) {
+ y <- 2
+ y ^ 2 + g(x)
+ }
> g <- function(x) {
+ x * y
+ }
Answer : 22
If you answered anything other than 22, you probably need to refresh the lexical scoping in
R. The function f(x) returns a value y^2 + g(x). y in this environment has been defined as 2
and g(x) from inside this function. The value of x is passed of function g as 6. Now comes the
catch, what is the value of free variable y here? Unlike dynamic environment where the value
is assumed from the parent environment, lexical scoping assumes the value of a variable from
the environment where the function is defined. The function g(x) is defined in the global
environment here, and hence the value of y is assumed to be 3. Therefore a value of 18 is
returned from the function g(x). f(6) is finally returning as 22.
You have been assigned to check two race tracks. To complete this task you are expected to
find the means of the total time taken by cars to cross the track. In the following data
assignment, “b” is the vector of total time taken by different cars and “a” is the vector of track
on which this time is taken. The first element of the vector “b” corresponds to the first element
of vector “a” (and so on).
How do you find the mean time of each track using split function?
How do you modify the code, to treat the missing value in the second track record?
Answer : The modified code is as follows :
> lapply(s,mean,na.rm=TRUE)
$`1` [1] 12.25
$`2` [1] 37.5