During machine learning one often needs to divide the two different data sets, namely training and testing datasets. While you can’t directly use the “sample” command in R, there is a simple workaround for this. Essentially, use the “sample” command to randomly select certain index number and then use the selected index numbers to divide the dataset into training and testing dataset. Below is the sample code for doing this. In the code below I use 20% of the data for testing and rest of the 80% for training.

# By default R comes with few datasets. 
data = mtcars
dim(data)  # 32 11

#Sample Indexes
indexes = sample(1:nrow(data), size=0.2*nrow(data))

# Split data
test = data[indexes,]
dim(test)  # 6 11
train = data[-indexes,]
dim(train) # 26 11

About Ritesh Agrawal

I am a applied researcher who enjoys anything related to statistics, large data analysis, data mining, machine learning and data visualization.
This entry was posted in Data Mining, Programming and tagged , . Bookmark the permalink.


  1. Doug Hill says:

    Thanks for posting this technique. It is exactly what I was looking for.

  2. bilal says:

    Thanks a lot pal !!!

  3. Ram says:

    Good one. Thank you.
    One point though: I had to add a comma after the index like this:
    test = data[indexes,]
    train = data[-indexes,]
    When I omitted the comma, I got a message:undefined columns selected.
    (I am using R version 3.0.)

    • AB says:

      I agree with Ram. Even I am getting the same issue.without comma
      Error in `[.data.frame`(data, indexes) : undefined columns selected

  4. Reshma says:

    I wanted to split data(.csv file) into development sample and validation sample with the ratio 70:30.I used these commands:
    id train test<-data[id,]
    now I am nt sure whether my data is divided or nor
    So can u help me?
    and please tell me how to split data as CSV file

  5. Ram A says:

    Before trying to split, you have to read the csv file into an R dataframe.
    An example might help.
    Here I am reading “tickets.csv” file and splitting it 70:30. After splitting, I use the nrow statement to check the size of the dataframes.
    (Note: You might not have “tickets csv” file, you should use your own csv file.)

    indexes = sample(1:nrow(tickets), size=0.3*nrow(tickets))

  6. Ish says:

    Thanks for this post. I tried and it worked for my test data but my train data seem to have the same nrows as the original dataframe. I was supposed to split a dataset of 499 records (2/3 for training and 1/3 for testing). As said the nrows for the test data =165 but that for the training data is still 499. This was the code:

  7. dont we need set.seed command here

  8. Victor Ordu says:

    Thanks for this post. I had a hunch that this was relatively simple to do in base R…

  9. vijay says:

    This could be simplest way to do that, looks easy also (not much syntax):

    indexes = sample(nrow(data), 0.2*nrow(data))

  10. kishorekumar says:

    Hi all,

    please help me how to check data is divide correct


  11. kishorekumar says:


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s