[AI] 1. MNIST를 이용한 딥러닝 - 데이터셋 불러오기

CS 전공/AI & ML

[AI] 1. MNIST를 이용한 딥러닝 - 데이터셋 불러오기

Easyho.log 2024. 7. 1. 14:11

MNIST 데이터셋을 불러오는 방법은 다음과 같다.

1) Keras에 내장된 데이터셋 불러오기

keras에서 데이터셋을 불러오기 위해서 다음과 같이 입력한다.

import tensorflow as tf
from tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt

mnist = keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

시험 삼아 데이터셋의 3번째 글씨를 가져와봤다.

plt.imshow(x_train[2], cmap=plt.cm.binary)
plt.show()

그러면 다음과 같이 잘 나온다.

2) Kaggle에서 가져와 경로로 데이터를 불러오기

우선 케글에서 데이터셋을 다운받는다.

[다운로드 경로] : https://www.kaggle.com/datasets/hojjatk/mnist-dataset/data

MNIST Dataset

The MNIST database of handwritten digits (http://yann.lecun.com)

www.kaggle.com

다운로드 받은 폴더 안에는 다음과 같은 파일이 들어있다.

이미지 파일과 라벨 파일이 각각 2개씩 들어있는데, 이를 파일 열기를 통해 열어야 한다.

파일 형식을 보니 보통 파일은 아닌 거 같아 이진 파일 형식으로 파일 오픈을 해보겠다.

그 전에 보통 파일은 아닌 거 같다 했으니 파일을 좀 자세하게 살펴보겠다.

이미지 파일의 형식은 idx3-ubyte 형식이다. 그냥 MNIST 손글씨체를 이미지 형식으로 나타낸 것이다.
1 Byte씩 읽으며, 28 X 28 픽셀의 0~9까지의 손글씨가 담겨있다.

따라서, 코드를 작성할 때 16Byte의 헤더 파일을 건너뛰고 나머지 파일을 읽는다. (~~아 그래? 한 번 봐보고 싶다~~)

뭐가 있는지 한 번 봐볼까?

import numpy as np

def read_idx3_file(file_path):
    with open(file_path, 'rb') as f:
        header = np.frombuffer(f.read(16), dtype=np.uint8)
        
        magic_number = header[0] * 256 + header[1]
        num_images = header[2] * 256 + header[3]
        num_rows = header[4] * 256 + header[5]
        num_cols = header[6] * 256 + header[7]
        
        print(f"Magic Number: {magic_number}")
        print(f"Number of Images: {num_images}")
        print(f"Number of Rows: {num_rows}")
        print(f"Number of Columns: {num_cols}")
        
        image_data = np.frombuffer(f.read(), dtype=np.uint8)
        image_data = image_data.reshape((num_images, num_rows, num_cols))
        
    return image_data, header

file_path = '파일_경로'
read_idx3_file(file_path)

사실 볼 필요 없지만 그냥 궁금해서 봐봤더니, 잘 안 나온다. (전처리를 안해서 그렇다^^)

Magic Number: 0
Number of Images: 2051
Number of Rows: 0
Number of Columns: 60000

아직 제대로 된 정보를 보지 않았지만 헤더 정보에는 다음과 같은 정보가 뜬다. 이 같은 정보를 건너뛰고 데이터를 읽기 위해 코드를 16 Byte를 건너뛰며 파일을 읽는 것이다. 따라서 코드는 다음과 같다.

라벨 파일도 8 Byte의 헤더 정보를 건너 뛰고 읽는다.

import numpy as np

def load_images(file_path):
    with open(file_path, 'rb') as f:
        f.read(16)  # 헤더 정보 생략
        data = np.frombuffer(f.read(), dtype=np.uint8) # 다음 정보부터 읽기, 타입은 부호 없는 8진수 정수
    num_images = data.size // (28 * 28)
    return data.reshape(num_images, 28, 28)

def load_labels(file_path):
    with open(file_path, 'rb') as f:
        f.read(8)  # 헤더 정보 생략
        data = np.frombuffer(f.read(), dtype=np.uint8) # 다음 정보부터 읽기, 타입은 부호 없는 8진수 정수
    return data

train_images = load_images(r"파일 경로")
train_labels = load_labels(r"파일 경로")
test_images = load_images(r"파일 경로")
test_labels = load_labels(r"파일 경로")

현재글[AI] 1. MNIST를 이용한 딥러닝 - 데이터셋 불러오기

컴공생이 끄적이는 오늘의 학습

티린이, 아두이노, 컴퓨터, 2024, 스터디, SQLD, BOJ, OS, 백준, 관계데이터, 컴퓨터아키텍쳐, 운영체제, cs, db, 프로세스, 컴공, 데이터모델, 모델링, 데이터베이스, 데이터베이스시스템,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

컴공생이 끄적이는 오늘의 학습