Random forest trees combine multiple decision trees to obtain an output. And it is flexible enough to adapt to Classification and Regression.
Random forest trees combine multiple decision trees to obtain an output. And it is flexible enough to adapt to Classification and Regression.
Random forest trees combine multiple decision trees to obtain an output. And it is flexible enough to adapt to Classification and Regression.
Methods that use multiple algorithms for one result are known as ensemble training. Random Forests are one of the ensemble training methods.
As we have touched on the topic of decision trees, let’s have a short discussion on Decision Trees.
When A decision/situation has two branches like whether to go for foreign education, so branches/threads are yes or no.
If yes which university to approach Stanford: yes/no.
Which bank to approach for a loan, Federal Bank/HDFC Bank? Decision trees help us make multiple decisions in a dataset. Consider a decision tree node as a conditional logic like an if condition.
This way, decisions are made in real life. Similarly, this algorithm mimics the decision-making process by using.
Now let’s consider the same concept with another example.
Example: consider if there’s a space shuttle launch scheduled for tomorrow morning. And the weather forecast tells us there are chances of weather being slightly cloudy but there’s a slight window that makes it so that the shuttle can still be launched within a margin of error.
After you’ve understood the concept of decision trees
Random forest algorithms use multiple trees, but these trees run in parallel. They run independently to generate the same result.
Let’s summarize the decision tree above. We use multiple layers of different factors that play a considerable part in decision-making. Where we start from a root node with the first condition, which reaches out to another two decision nodes and so on. With this as a base, a random forest classifier utilizes multiple decision trees with different subsets of factors. Random Forest Classifier uses multiple decision trees to conclude the same set of decisions. All the multiple decision trees use randomly selected subsets, and the trees get votes on the accuracy of their results. And most popular trees will be chosen to create the final model of the algorithm.
Let’s create and analyse our own implementation of the algorithm.
import pandas as pd
df = pd.read_csv('social_data.csv')
users | age | gender | city | tier | pincode | photos | average likes | average comments | followers | daily user | daily_use_norm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 665061268 | 30 | 0 | 18 | 1 | 695676 | 720 | 540 | 22 | 558 | 9861381 | 0 |
1 | 670375745 | 17 | 0 | 4 | 2 | 454465 | 237 | 136 | 9 | 144 | 515712 | 1 |
2 | 663125396 | 22 | 0 | 19 | 3 | 290159 | 140 | 1254 | 197 | 1311 | 1168320 | 1 |
3 | 679258284 | 26 | 0 | 2 | 2 | 347634 | 516 | 104 | 5 | 108 | 1159142 | 1 |
4 | 676211941 | 25 | 1 | 10 | 2 | 656937 | 399 | 502 | 32 | 522 | 3267361 | 1 |
… | … | … | … | … | … | … | … | … | … | … | … | … |
1124 | 673315234 | 98 | 0 | 8 | 3 | 535460 | 34 | 47 | 144 | 71 | 787 | 1 |
1125 | 663589437 | 42 | 1 | 17 | 2 | 354073 | 35 | 73 | 92 | 107 | 2971 | 1 |
1126 | 676728047 | 99 | 1 | 19 | 3 | 411071 | 27 | 18 | 77 | 75 | 473 | 1 |
1127 | 661133538 | 63 | 0 | 16 | 2 | 249356 | 34 | 84 | 159 | 116 | 2083 | 1 |
1128 | 684290241 | 87 | 1 | 19 | 2 | 381137 | 25 | 39 | 144 | 77 | 521 | 1 |
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from scipy.stats import randint
X=df.iloc[:,1:-2]
Y=df.iloc[:,-1]
In the following code, we are splitting the data into a test train of x and y into 4 different subsets for fitting them in the random forest classification model.
from sklearn.model_selection import RandomizedSearchCV, train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier()
model.fit(X_train,Y_train)
RandomForestClassifier
RandomForestClassifier()
We will predict the trained model with the test data we extracted before training.
Y_pred = model.predict(X_test)
accuracy = accuracy_score(Y_test, Y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.9705014749262537
PythonSo we have an accuracy of 97%, yet we do not have any idea how the decision tree structure formed during the training. So the library graphviz will visualize the tree. Following is the code for rendering the decision tree from the trained model.
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz
for i in range(1):
tree = model.estimators_[i]
dot_data = export_graphviz(tree, feature_names=X_train.columns, filled=True, impurity=False, proportion=True)
graph = graphviz.Source(dot_data)
display(graph)
graph.format = 'png'
graph.render('dtree_render',view=True)
Random forest regression has a similar build as a classifier; it uses multiple decision trees running separately to come to the same result as other trees. In the case of regression, we use the aggregate of all trees to predict an output. These trees use different samples of data and different subsets of columns in a dataset.
In the following example of random forest regression, we will be using the data collected for Moore’s law. Moore’s law says the number of transistors on a microchip will double every year, meanwhile, the cost of a computer will be half of the previous two years. Forget the cost part, we will have the year and number of transistors on a microchip per year since 1971.
import pandas as pd
import numpy as np
import matplotlib.pyplot as pyplot
df=pd.read_csv('moore.csv',names=['year','transistors'])
df
# year and most number of transisitors fit inside a microprocessor.
year | transistors |
---|---|
1971 | 2300 |
1972 | 3500 |
1973 | 2500 |
1973 | 2500 |
1974 | 4100 |
… | … |
2017 | 18000000000 |
2017 | 19200000000 |
2018 | 8876000000 |
2018 | 23600000000 |
2018 | 9000000000 |
162 rows × 2 columns
First, we will separate the x and y data. And then split test and train data. We will predict the number of transistors by year.
# seperate
X=df['year']
Y=df['transistors']
# Split train and test data
from sklearn.model_selection import RandomizedSearchCV, train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
In following code we will train the random forest regressor. Be mindful that the data will be reshaped to (-1,1)
X=df[‘year’]
Y=df[‘transistors’]
# import libraries
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators= 10, random_state=0)
# training model
regressor.fit(X.values.reshape(-1,1),Y.values.reshape(-1,1))
RandomForestRegressor(n_estimators=10, random_state=0)
Since this is a regression algorithm, we will visualize this into more traditional regression model representation. With scatter plot of data and line plot predicted from the model.
# create array within range of maximum anad minimum number of
grx = np.arange(min(X), max(X), 0.01)
grx = grx.reshape((len(X_grid), 1))
plt.scatter(X, Y, color = 'cyan')
plt.plot(grx, regressor.predict(grx),color = 'red')
plt.title('Random Forest Regression')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
ANCOVA is an extension of ANOVA (Analysis of Variance) that combines blocks of regression analysis and ANOVA. Which makes it Analysis of Covariance.
What if we learn topics in a desirable way!! What if we learn to write Python codes from gamers data !!
Start using NotebookLM today and embark on a smarter, more efficient learning journey!
This can be a super guide for you to start and excel in your data science career.
A method to find a statistical relationship between two variables in a dataset where one variable is used to group data.
Seaborn library has matplotlib at its core for data point visualizations. This library gives highly statistical informative graphics functionality to Seaborn.
The Matplotlib library helps you create static and dynamic visualisations. Dynamic visualizations that are animated and interactive. This library makes it easy to plot data and create graphs.
This library is named Plotly after the company of the same name. Plotly provides visualization libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST.
Numpy array have functions for matrices ,linear algebra ,Fourier Transform. Numpy arrays provide 50x more speed than a python list.
Numpy has created a vast ecosystem spanning numerous fields of science.
Pandas is a easy to use data analysis and manipulation tool. Pandas provides functionality for categorical,ordinal, and time series data . Panda provides fast and powerful calculations for data analysis.
In this tutorial, you will learn How to Access The Data in Various Ways From the dataframe.
Understand one of the important data types in Python. Each item in a set is distinct. Sets can store multiple items of various types of data.
Tuples are a sequence of Python objects. A tuple is created by separating items with a comma. They are put inside the parenthesis “”(“” , “”)””.
Leave a Reply
You must be logged in to post a comment.