How to Create Dummy Data in R
What is a Dummy Data?
Dummy data are used to simulate real data. In the absence of real data, dummy data can be used to simulate the results of the study based on some initial assumptions. Software developers use dummy data to develop and test products before deploying the final product in production. Developers may need to rely solely on dummy data at the early stages of development when real data are not yet available. Dummy data are used in research to conduct test-runs of data analysis before real data becomes available – this is done to save time. Test-run analyses are conducted upfront to QC and resolve as many issues as possible before the final run is conducted on real data. The final run would be completed quicker (after receiving real data) if most of the issues were already resolved during test-runs.
Another important use of dummy data is to illustrate or educate. The dummy data generated in this post is for educational purposes. We will use the data (in other posts) to show how certain graphs are created in R.
Description of the Data
Let’s name our dummy dataset the rash dummy data. The dataset consists of 400 patients suffering from some kind of rash. Each patient will be treated with one of two products (Product A or Product B). We will generate the size of the rash (surface area in centimeters squared) before treatment, and the size of the rash 2 weeks after treatment. We will also generate the gender (male, female) of each patient and also their location (USA, UK, Canada, Mexico).
Generating the Dummy Data in R
There are different methods of generating dummy data in R. You can either manually enter each value (not feasible for large datasets), or use a random number generator to generate the values.
Option 1: Manually enter the values if the dataset is small
This is only feasible if the dataset is a small dataset with about 5 to 10 rows. If you need dummy data with hundreds or thousands of rows, it will not be feasible to manually enter the values.
#not preferred for large datasets
rashdummy1 <- data.frame(patientn=c(1,2,3,4,5),
gender=c("M","F","F","M","F"),
productid=c("A","B","B","A","A"),
baselarea=c(230,365,390,201,298),
rashcleared=c(200,310,90,100, 91)
)
Option 2: Use a random number generator (Preferred Option)
First, we create vectors that will become the variables (columns) in our dummy dataset. Then we will merge all the variables to get the final dataset.
#400 patients
patientn <- rep(1:400)
#Assign the gender for each patient using rbinom() function 0 for Female and 1 for male or vice versa
#400 rows/patients, one trial with probability of 0.5 for 1
gender <- rbinom(400, 1, 0.5)
#Assign 200 patients to product A and the other 200 to product B
productid <- rep(c("Product A","Product B"), times=200)
#Assigning the countries: first 50 patients to USA, the next 50 to UK and so on
country <- rep(c("USA","UK","Canada","Mexico"), each=50)
#Baseline rash area before treatment and a second variable for further derivation below
baselarea <- rnorm(400,400,60) #baseline area
baselarea2 <- rnorm(400,400,65) #baseline area2
#Factor by which to decrease the baseline rash area, per product and country
sfactor <- c(
rep(c(0.3,0.9), times = 25),
rep(c(0.4,0.5), times = 25),
rep(c(0.2,0.4), times = 25),
rep(c(0.3,0.7), times = 25)
)
Now we put all the variables/columns together using the data.frame() function. Then we derive the variable that holds the final rash area after 2 weeks of treatment.
#Put all variables data together
rashdummy2a <- data.frame(patientn,gender,productid,country,baselarea,baselarea2,sfactor)
#Apply treatment effect per product and country - percentage of rash area left
#Install either the tidyverse or the dplyr package to be able to use the mutate and select functions
rashdummy2b <- rashdummy2a %>% mutate(rashleft = baselarea2*sfactor/baselarea*100)
#format gender
rashdummy2c <- mutate(rashdummy2b, genderc = ifelse(gender == 0, "Male", "Female"))
#keep only the variables needed in the final dataset
rashDummyFinal <- select(rashdummy2c, patientn,productid,genderc,country,rashleft)
The final dummy dataset has four variables: patient identifier, product received, patient location, and a variable that indicated the percentage of the total rash area cleared by the product.
As mentioned earlier, this dummy data will be used in other posts to illustrate how to create various graphs in R.
Generating Results from the Dummy Data
We will now generate some descriptive statistics from the dummy data. The following R code generates the mean and standard deviation. standard error, and confidence interval for each group, per category of the subgroups.
1. Descriptive Statistics
#compute descriptive statistics per country per group
statsrash <- rashdummy %>%
group_by(country,productid) %>%
summarise(
count = n(),
mean_r = mean(rashleft,na.rm=TRUE),
sd_r = sd(rashleft, na.rm=TRUE),
se_r = sd_r/sqrt(count),
ci95lower = mean_r - se_r*1.96,
ci95upper = mean_r + se_r*1.96
)
2. Difference in Means of the Two Groups
Let us now generate some more dummy results from the above dummy data using ANOVA. We will calculate differences in mean and the corresponding 95% confidence interval of the difference in means. Please pay little or no attention to the choice of the statistical method used below. The aim here is to generate some dummy result data that will be used in another post to illustrate how certain graphs are created in R. The results below are generated for each subgroup of country and gender. We are intentionally not applying a multiplicity correction.
#Create function to be used to generate the dummy results: mean difference and 95% CI
subgr <- function(sbg){
rashdummy_0 <- rashdummy
rashdummy_0$sbgroup <- rashdummy_0[[substitute(sbg)]]
#create container for values
diffci_1 <- matrix(NA, length(unique(rashdummy_0$sbgroup)), 4)
#loop generates results for each subgroup
for (x in seq(1,length(unique(rashdummy_0$sbgroup)))) {
#ANOVA: diff and CI
rashdummy_1 <- filter(rashdummy_0, sbgroup==unique(rashdummy_0$sbgroup)[x])
ANOVA1 <- aov(rashleft ~ productid, data=rashdummy_1)
ANOVA2 <- TukeyHSD(ANOVA1,'productid')
#Put result in container
diffci_1[x,1] <- unique(rashdummy_0$sbgroup)[x]
diffci_1[x,2] <- ANOVA2$productid[1]
diffci_1[x,3] <- ANOVA2$productid[2]
diffci_1[x,4] <- ANOVA2$productid[3]
}
#convert result matrix to data frame and add column names before returning
diffci_2 <- data.frame(diffci_1)
names(diffci_2) <- c("Subgroup","diff","lowerCL","upperCL")
return(diffci_2)
}
#run function to get results for country and gender as subgroups
res_cntry <- subgr(country)
res_gendr <- subgr(genderc)
Sometimes the variables diff, lowerCL and upperCL may come out as character variables even though they are numeric. If this happens, the following R code (based on the dplyr package) can be used to convert them to numeric.
#converting character variables to numeric using the function mutate_at() from the dplyr package
diffci_3 <- diffci_2 %>%
mutate_at(c('diff', 'lowerCL','upperCL'), as.numeric)