How To Load Machine Learning Data From Files In Python
The common data format in Machine Learning is a CSV file (comma separated values). In this Tutorial I show 4 different ways how you can load the data from such files and then prepare the data.
The common data format in Machine Learning is a CSV file (comma separated values). In this Tutorial I show 4 different ways how you can load the data from such files and then prepare the data. I also show you some best practices on how to deal with the correct data type, missing values, and an optional header. The 4 approaches are:
- with the
If you enjoyed this video, please subscribe to the channel!
The code and all Machine Learning tutorials can be found on GitHub.
import csv import numpy as np import pandas as pd # download data from https://archive.ics.uci.edu/ml/datasets/spambase FILE_NAME = "spambase.data" # 1) load with csv file with open(FILE_NAME, 'r') as f: data = list(csv.reader(f, delimiter=",")) data = np.array(data, dtype=np.float32) print(data.shape) # 2) load with np.loadtxt() # skiprows=1 data = np.loadtxt(FILE_NAME, delimiter=",",dtype=np.float32) print(data.shape, data.dtype) # 3) load with np.genfromtxt() # skip_header=0, missing_values="---", filling_values=0.0 data = np.genfromtxt(FILE_NAME, delimiter=",", dtype=np.float32) print(data.shape) # split into X and y n_samples, n_features = data.shape n_features -= 1 X = data[:, 0:n_features] y = data[:, n_features] print(X.shape, y.shape) print(X[0, 0:5]) # or if y is the first column # X = data[:, 1:n_features+1] # y = data[:, 0] # 4) load with pandas: read_csv() # na_values = ['---'] df = pd.read_csv(FILE_NAME, header=None, skiprows=0, dtype=np.float32) df = df.fillna(0.0) # dataframe to numpy data = df.to_numpy() print(data[4, 0:5]) # convert datatypes in numpy #data = np.asarray(data, dtype = np.float32) #print(data.dtype)
FREE VS Code / PyCharm Extensions I Use
✅ Write cleaner code with Sourcery, instant refactoring suggestions: Link*
PySaaS: The Pure Python SaaS Starter Kit
🚀 Build a software business faster with pure Python: Link*