Tuesday, June 4, 2019
Using RStudio to Prepare and Clean Data
Using RStudio to Prepare and Clean DataThere is now more info on hand(predicate) than ever before, the depth and scope is increasing daily. The explosion of the internet and connected devices has increased this and big data is now big business. With the increase in data avai testing groundle to us, so has the need for analysis of this data. Many companies engagement this data to predict future trends. Also, what has changed is the tools we handling to analysis and present this data in a meaningful way.In the past statistical software was very expensive and often with no graphical capabilities. Enter the R programme language a tool that supports both, first released in 1995 with the first stable build in 2000, now on version 3 which was released in 2013. R is a plain open source project with over 7000 add on packages available. Many companies such as Google and Facebook are employ R for their data analysis.In this lab book we go away look at cleaning and preparing data so it c an analysed. We will use R Studio which is an IDE (Integrated development environment) for the R programming language. R Studio is available as an open source or commercial version, it has two editions R Studio desktop and R Studio horde and runs on Windows, macOS and Linux operating systems.The data make we relieve oneself is from the UK government, and is based on MOT outlets in England, Scotland, and Wales, it contains data such as name, address, post codes, telephone numbers and categories of vehicles tested. On promptly analysis of the dataset there are a lot of blank vault of heavens, extra discolour spaces, typos in the telephone column as well as jiffy telephone numbers separated by the / symbol.Using R Studio we will attempt to tidy and clean the dataset. In this lab book we will explain the various commands and techniques used to prepare the data for analytical analysis.Make a copy of the data to work withmethod acting Here we make a copy of the original dataset x201 6motsitelist and call it MotList, this is good practice as you will not contaminate the original dataset.TestResult From the above cover song conjecture you can mind we have renamed our dataset to MotList, by using the name of the dataset in R studio it lists the dataset in a screen dump on the console. achieve the Stucture of Our Data FrameMethod by using the str() command in the console we get the expression of our data.TestResult by using the structure command str() we can see that our dataset has 22,980 objects and 14 antithetic variables. The next lines which contain $ indicate column headings and display some of the components included in these columns. This command solely provides a list with components and names.View the dataMethod Using the Head command to view the data.TestResult using this command the first 6 records are displayed in the console window.ID names of columnsMethod We use the Names command to display column names.TestResult this displays the names of o ur columns in the console window.Summary of what is contained in the columnsMethod we use the Summary command to get an overview of the data in our columns.TestResult the pumpmary command gives us an overview for every(prenominal) vector in the data frame, tells us in our case that the length is 22980 rows, that all vectors are character classes.Missing valuesMethod we will use the is.na command, combination of is.na with the any command and lastly the sum command to check for wanting values in the data.TestResult the result of the is.na command returns a Boolean true or false result on the data set to tell us if a missing value is present or not.TestResult with the use of the any command we find that there is indeed missing data in the dataset.TestResult with the use of the sum command we get the number of missing records, which is 149097 in this case.Rename columns in our data set.Method we use the colnames command to change the columns in our data set that are numbered 1,2,3,4 ,5 and 7.TestResult with the use of the above commands we change the name of the columns using the name to identify which column to apply the name change to. We use the names(MotList) to verify the result.TestRemove NA from the different categories of vehicle that are Mot testedMethod we create another copy of our dataset and call it MotListMod, on this dataset we will change the NA values in the columns that we renamed earlier so that the different categories of vehicles tested will have complete values and no missing data. We do this by giving the dataset name and then the $ column name, we then use the which command and is.na to change the value to the desired result.TestResult As can be seen from the screen shot above, we have changed the NA values in the six columns of our dataset, our dataset now tells us if a Mot test center on carries out tests on the different vehicle categories Y or N, were as before it only told us the if the centre did Y with a blank field for N. Again, we run the sum is.na command on both datasets we have, now the MotListMod dataset has far less Nas in the dataset.Remove and tidy up VTS Telephone column.Method Firstly, using the GSUB command we removed instances of Tel. and TEL. from our column, secondly, we separate the column in two sections number 1 and 2 with the SEPARATE command as some of the test centres have two telephone numbers separated by / in the dataset, thirdly we tidy up the white space.TestUsing GSUB wrongly above didnt fire the desired outcome, but in the two screens below we get the desired outcome.The above screen shows were the VTS Telephone column is split into different sections.Trimming white space from the front of the telephone numbers.Removing the NAs from the VTS Telephone number2Result By using GSUB and identifying the column we wanted to target, we commuted the instances of Tel. and TEL. in our dataset with whitespace, we then proceeded to split the column into two different sections, when we did this it created a lot of NAs in the second column because not every test centre has two telephone numbers, so to counter act this we replace the NAs with the value 0. We then tidy up the white space at the start of the two columns.Write to CSV file in R studioMethod We will write the MotListMod3 dataset to a CSV file with the WRITE.CSV command.TestResult The above command writes the dataset to a csv file and can be viewed or shared with others, see above screen shot of the file in excel.Outliers and plot function.Method using the HIST command we produce a histogram of the cars column, the columns class had to be changed to a gene form to make the function work, also we used the Table command to count the number of Y and N in this column.TestIn the screen shot above you can see a histogram of the cars column.Result No outliers are present as our columns only have a Y or N present in the different type of vehicles tested columns. Also, our data was of class character, this had to be converted to a factor form so as we could use the histogram function on the column cars. We used the table command on the column to display a numeric result for N = 1054 and Y=21926.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.