Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Tensorflow | 175,280 | 327 | 77 | 5 hours ago | 46 | October 23, 2019 | 2,141 | apache-2.0 | C++ | |
An Open Source Machine Learning Framework for Everyone | ||||||||||
Transformers | 102,914 | 64 | 911 | 6 hours ago | 91 | June 21, 2022 | 742 | apache-2.0 | Python | |
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. | ||||||||||
Pytorch | 67,538 | 146 | 5 hours ago | 23 | August 10, 2022 | 12,187 | other | Python | ||
Tensors and Dynamic neural networks in Python with strong GPU acceleration | ||||||||||
Keras | 58,521 | 330 | 7 hours ago | 68 | May 13, 2022 | 390 | apache-2.0 | Python | ||
Deep Learning for humans | ||||||||||
Cs Video Courses | 56,273 | 5 days ago | 17 | |||||||
List of Computer Science courses with video lectures. | ||||||||||
Faceswap | 44,904 | 21 days ago | 24 | gpl-3.0 | Python | |||||
Deepfakes Software For All | ||||||||||
D2l Zh | 44,047 | 1 | 5 days ago | 45 | March 25, 2022 | 34 | apache-2.0 | Python | ||
《动手学深度学习》:面向中文读者、能运行、可讨论。中英文版被60多个国家的400多所大学用于教学。 | ||||||||||
Tensorflow Examples | 42,312 | 8 months ago | 218 | other | Jupyter Notebook | |||||
TensorFlow Tutorial and Examples for Beginners (support TF v1 & v2) | ||||||||||
100 Days Of Ml Code | 40,344 | 3 months ago | 61 | mit | ||||||
100 Days of ML Coding | ||||||||||
Deepfacelab | 39,560 | 14 days ago | 531 | gpl-3.0 | Python | |||||
DeepFaceLab is the leading software for creating deepfakes. |
Text Classification
GitHubChinese Word Vectors Sogou News 300d
scikit-learn
C000008``.20061127.zip``CN_Corpus
CN_Corpus
SogouC.reduced
Reduced
C000008
C000010
C000013
C000014
C000016
C000020
C000022
C000023
C000024
category_labels = {
'C000008': '_08_Finance',
'C000010': '_10_IT',
'C000013': '_13_Health',
'C000014': '_14_Sports',
'C000016': '_16_Travel',
'C000020': '_20_Education',
'C000022': '_22_Recruit',
'C000023': '_23_Culture',
'C000024': '_24_Military'
}
80%20%
data
test
_08_Finance
_10_IT
_13_Health
_14_Sports
_16_Travel
_20_Education
_22_Recruit
_23_Culture
_24_Military
train
_08_Finance
_10_IT
_13_Health
_14_Sports
_16_Travel
_20_Education
_22_Recruit
_23_Culture
_24_Military
(X_train_data
)(y_train
)(X_test_data
)(y_test
)
X_train_data, y_train, X_test_data, y_test = load_datasets()
label: _08_Finance, len: 1500
label: _10_IT, len: 1500
label: _13_Health, len: 1500
label: _14_Sports, len: 1500
label: _16_Travel, len: 1500
label: _20_Education, len: 1500
label: _22_Recruit, len: 1500
label: _23_Culture, len: 1500
label: _24_Military, len: 1500
train corpus len: 13500
label: _08_Finance, len: 490
label: _10_IT, len: 490
label: _13_Health, len: 490
label: _14_Sports, len: 490
label: _16_Travel, len: 490
label: _20_Education, len: 490
label: _22_Recruit, len: 490
label: _23_Culture, len: 490
label: _24_Military, len: 490
test corpus len: 4410
X_train_data[1000]
' '
y_train[1000]
'_08_Finance'
-TF-IDFTF-IDFTF-IDFTF-IDFTF-IDFTF-IDF
TfidfVectorizer``TF-IDF
Tokenizing
stopwords = open('dict/stop_words.txt', encoding='utf-8').read().split()
# TF-IDF feature extraction
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_data)
words = tfidf_vectorizer.get_feature_names()
X_train_tfidf.shape
(13500, 223094)
len(words)
223094
scikit-learn
MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
tfidf_vectorizer``transform``fit_transform``TF-IDF``predict
news_lastest = ["360360360360360360360",
"50",
"519"]
X_new_data = [preprocess(doc) for doc in news_lastest]
X_new_data
[' ',
' ',
' ']
X_new_tfidf = tfidf_vectorizer.transform(X_new_data)
predicted = classifier.predict(X_new_tfidf)
predicted
array(['_08_Finance', '_20_Education', '_24_Military'], dtype='<U13')
84.35%Benchmarkclassification_report
print(classification_report(y_test, predicted))
precision recall f1-score support
_08_Finance 0.88 0.88 0.88 488
_10_IT 0.72 0.87 0.79 403
_13_Health 0.82 0.84 0.83 478
_14_Sports 0.95 1.00 0.97 466
_16_Travel 0.86 0.92 0.89 455
_20_Education 0.71 0.87 0.79 401
_22_Recruit 0.91 0.65 0.76 690
_23_Culture 0.80 0.77 0.79 513
_24_Military 0.94 0.89 0.92 516
accuracy 0.84 4410
macro avg 0.84 0.86 0.84 4410
weighted avg 0.85 0.84 0.84 4410
confusion_matrix(y_test, predicted)
array([[429, 12, 15, 10, 5, 3, 6, 4, 4],
[ 20, 352, 6, 3, 5, 2, 10, 4, 1],
[ 3, 38, 403, 0, 6, 16, 5, 7, 0],
[ 0, 1, 0, 464, 0, 0, 0, 1, 0],
[ 5, 11, 0, 0, 419, 5, 2, 12, 1],
[ 4, 11, 5, 1, 2, 350, 13, 14, 1],
[ 22, 21, 57, 8, 14, 87, 448, 32, 1],
[ 3, 25, 4, 4, 34, 23, 5, 394, 21],
[ 4, 19, 0, 0, 5, 4, 1, 22, 461]])
Logistic Regression
text_clf_lr = Pipeline([
('vect', TfidfVectorizer()),
('clf', LogisticRegression()),
])
precision recall f1-score support
_08_Finance 0.87 0.91 0.89 465
_10_IT 0.77 0.86 0.81 440
_13_Health 0.91 0.82 0.86 546
_14_Sports 0.98 0.99 0.98 483
_16_Travel 0.90 0.90 0.90 488
_20_Education 0.79 0.91 0.85 429
_22_Recruit 0.86 0.85 0.85 495
_23_Culture 0.86 0.75 0.80 556
_24_Military 0.95 0.92 0.93 508
accuracy 0.88 4410
macro avg 0.88 0.88 0.88 4410
weighted avg 0.88 0.88 0.88 4410
Benchmark
array([[425, 11, 7, 1, 2, 4, 9, 5, 1],
[ 23, 377, 4, 3, 9, 3, 12, 7, 2],
[ 9, 41, 447, 0, 6, 23, 10, 10, 0],
[ 0, 2, 0, 478, 0, 1, 0, 1, 1],
[ 8, 12, 0, 0, 440, 8, 3, 16, 1],
[ 1, 6, 2, 0, 1, 389, 20, 10, 0],
[ 8, 7, 22, 1, 5, 26, 420, 6, 0],
[ 11, 23, 8, 6, 24, 30, 15, 419, 20],
[ 5, 11, 0, 1, 3, 6, 1, 16, 465]])
SVMLR
text_clf_svm = Pipeline([
('vect', TfidfVectorizer()),
('clf', SGDClassifier(loss='hinge', penalty='l2')),
])
precision recall f1-score support
_08_Finance 0.87 0.92 0.90 463
_10_IT 0.78 0.85 0.81 446
_13_Health 0.93 0.82 0.87 558
_14_Sports 0.99 0.99 0.99 488
_16_Travel 0.91 0.91 0.91 489
_20_Education 0.82 0.92 0.86 437
_22_Recruit 0.89 0.85 0.87 513
_23_Culture 0.85 0.83 0.84 503
_24_Military 0.96 0.91 0.93 513
accuracy 0.89 4410
macro avg 0.89 0.89 0.89 4410
weighted avg 0.89 0.89 0.89 4410
array([[428, 11, 5, 1, 5, 3, 5, 8, 0],
[ 23, 382, 4, 0, 6, 8, 11, 7, 2],
[ 9, 39, 453, 0, 7, 19, 10, 9, 0],
[ 0, 1, 0, 485, 0, 2, 0, 1, 0],
[ 7, 14, 0, 0, 449, 7, 3, 14, 1],
[ 3, 7, 1, 1, 1, 403, 14, 15, 0],
[ 12, 8, 22, 0, 4, 25, 437, 8, 0],
[ 4, 15, 5, 3, 15, 17, 9, 411, 19],
[ 4, 13, 0, 0, 3, 6, 1, 17, 468]])
texts, labels = load_raw_datasets()
0 | C000008 | Finance | [1, 0, 0, 0, 0, 0, 0, 0, 0] |
1 | C000010 | IT | [0, 1, 0, 0, 0, 0, 0, 0, 0] |
2 | C000013 | Health | [0, 0, 1, 0, 0, 0, 0, 0, 0] |
3 | C000014 | Sports | [0, 0, 0, 1, 0, 0, 0, 0, 0] |
4 | C000016 | Travel | [0, 0, 0, 0, 1, 0, 0, 0, 0] |
5 | C000020 | Education | [0, 0, 0, 0, 0, 1, 0, 0, 0] |
6 | C000022 | Recruit | [0, 0, 0, 0, 0, 0, 1, 0, 0] |
7 | C000023 | Culture | [0, 0, 0, 0, 0, 0, 0, 1, 0] |
8 | C000024 | Military | [0, 0, 0, 0, 0, 0, 0, 0, 1] |
1 2 300
364180 300
0.003146 0.582671 0.049029 -0.312803 0.522986 0.026432 -0.097115 0.194231 -0.362708 ...... ......
embeddings_index = load_pre_trained()
Found 364180 word vectors, dimension 300
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np
MAX_SEQUENCE_LEN = 1000 #
MAX_WORDS_NUM = 20000 #
VAL_SPLIT_RATIO = 0.2 #
tokenizer = Tokenizer(num_words=MAX_WORDS_NUM)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print(len(word_index)) # all token found
# print(word_index.get('')) # get word index
dict_swaped = lambda _dict: {val:key for (key, val) in _dict.items()}
word_dict = dict_swaped(word_index) # swap key-value
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LEN)
labels_categorical = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels_categorical.shape)
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels_categorical = labels_categorical[indices]
# split data by ratio
val_samples_num = int(VAL_SPLIT_RATIO * data.shape[0])
x_train = data[:-val_samples_num]
y_train = labels_categorical[:-val_samples_num]
x_val = data[-val_samples_num:]
y_val = labels_categorical[-val_samples_num:]
word_index``word_index
2000020000
len(data[data>=20000])
0
# convert from index to origianl doc
for w_index in data[0]:
if w_index != 0:
print(word_dict[w_index], end=' ')
category_labels[dict_swaped(labels_index)[argmax(labels_categorical[0])]]
'_20_Education'
MAX_WORDS_NUM``(MAX_WORDS_NUM, EMBEDDING_DIM)``i``word_index``i``pad_sequence``0
2000092.35%
EMBEDDING_DIM = 300 # embedding dimension
embedding_matrix = np.zeros((MAX_WORDS_NUM+1, EMBEDDING_DIM)) # row 0 for 0
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if i < MAX_WORDS_NUM:
if embedding_vector is not None:
# Words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
nonzero_elements / MAX_WORDS_NUM
0.9235
sequence
KerasTokenizer API``Embedding
3
input_dim``MAX_WORDS_NUM + 1
output_dim``EMBEDDING_DIM
input_length``MAX_SEQUENCE_LEN
weights=[embedding_matrix]``trainable=False``Embedding
Keras
Flatten
(Dense
)
from keras.models import Sequential
from keras.layers import Dense, Flatten
input_dim = x_train.shape[1]
model1 = Sequential()
model1.add(Embedding(input_dim=MAX_WORDS_NUM+1,
output_dim=EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LEN))
model1.add(Flatten())
model1.add(Dense(64, activation='relu', input_shape=(input_dim,)))
model1.add(Dense(64, activation='relu'))
model1.add(Dense(len(labels_index), activation='softmax'))
model1.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
history1 = model1.fit(x_train,
y_train,
epochs=30,
batch_size=128,
validation_data=(x_val, y_val))
Train on 14328 samples, validate on 3582 samples
Epoch 1/30
14328/14328 [==============================] - 59s 4ms/step - loss: 3.1273 - acc: 0.2057 - val_loss: 1.9355 - val_acc: 0.2510
Epoch 2/30
14328/14328 [==============================] - 56s 4ms/step - loss: 2.0853 - acc: 0.3349 - val_loss: 1.8037 - val_acc: 0.3473
Epoch 3/30
14328/14328 [==============================] - 56s 4ms/step - loss: 1.7210 - acc: 0.4135 - val_loss: 1.2498 - val_acc: 0.5731
......
Epoch 29/30
14328/14328 [==============================] - 56s 4ms/step - loss: 0.5843 - acc: 0.8566 - val_loss: 1.3564 - val_acc: 0.6516
Epoch 30/30
14328/14328 [==============================] - 56s 4ms/step - loss: 0.5864 - acc: 0.8575 - val_loss: 0.5970 - val_acc: 0.8501
Keras
layer.get_weights()``numpy array
layer.set_weights(weights)``numpy array``numpy array``layer.get_weights()
embedding_custom = model1.layers[0].get_weights()[0]
embedding_custom
array([[ 0.39893672, -0.9062594 , 0.35500282, ..., -0.73564297,
0.50492775, -0.39815223],
[ 0.10640696, 0.18888871, 0.05909824, ..., -0.1642032 ,
-0.02778293, -0.15340094],
[ 0.06566656, -0.04023357, 0.1276007 , ..., 0.04459211,
0.08887506, 0.05389333],
...,
[-0.12710813, -0.08472785, -0.2296919 , ..., 0.0468552 ,
0.12868881, 0.18596107],
[-0.03790742, 0.09758633, 0.02123675, ..., -0.08180046,
0.10254312, 0.01284804],
[-0.0100647 , 0.01180602, 0.00446023, ..., 0.04730382,
-0.03696882, 0.00119566]], dtype=float32)
get_weights``get_config()
model1.layers[0].get_config()
{'activity_regularizer': None,
'batch_input_shape': (None, 1000),
'dtype': 'float32',
'embeddings_constraint': None,
'embeddings_initializer': {'class_name': 'RandomUniform',
'config': {'maxval': 0.05, 'minval': -0.05, 'seed': None}},
'embeddings_regularizer': None,
'input_dim': 20001,
'input_length': 1000,
'mask_zero': False,
'name': 'embedding_13',
'output_dim': 300,
'trainable': True}
plot_history(history1)
3030epoch
from keras.models import Sequential
from keras.layers import Dense, Flatten
input_dim = x_train.shape[1]
model2 = Sequential()
model2.add(Embedding(input_dim=MAX_WORDS_NUM+1,
output_dim=EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LEN,
trainable=False))
model2.add(Flatten())
model2.add(Dense(64, activation='relu', input_shape=(input_dim,)))
model2.add(Dense(64, activation='relu'))
model2.add(Dense(len(labels_index), activation='softmax'))
model2.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
history2 = model2.fit(x_train,
y_train,
epochs=10,
batch_size=128,
validation_data=(x_val, y_val))
Train on 14328 samples, validate on 3582 samples
Epoch 1/10
14328/14328 [==============================] - 37s 3ms/step - loss: 1.3124 - acc: 0.6989 - val_loss: 0.7446 - val_acc: 0.8088
Epoch 2/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.2831 - acc: 0.9243 - val_loss: 0.5712 - val_acc: 0.8551
Epoch 3/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.1183 - acc: 0.9704 - val_loss: 0.6261 - val_acc: 0.8624
Epoch 4/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0664 - acc: 0.9801 - val_loss: 0.6897 - val_acc: 0.8607
Epoch 5/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0549 - acc: 0.9824 - val_loss: 0.7199 - val_acc: 0.8660
Epoch 6/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0508 - acc: 0.9849 - val_loss: 0.7261 - val_acc: 0.8582
Epoch 7/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0513 - acc: 0.9865 - val_loss: 0.8251 - val_acc: 0.8585
Epoch 8/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0452 - acc: 0.9858 - val_loss: 0.7891 - val_acc: 0.8707
Epoch 9/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0469 - acc: 0.9865 - val_loss: 0.8663 - val_acc: 0.8680
Epoch 10/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0418 - acc: 0.9867 - val_loss: 0.9048 - val_acc: 0.8640
plot_history(history2)
from keras.layers import Dense, Input, Embedding
from keras.layers import Conv1D, MaxPooling1D, Flatten
from keras.models import Model
embedding_layer = Embedding(input_dim=MAX_WORDS_NUM+1,
output_dim=EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LEN,
trainable=False)
sequence_input = Input(shape=(MAX_SEQUENCE_LEN,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x) # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)
model3 = Model(sequence_input, preds)
model3.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
history3 = model3.fit(x_train,
y_train,
epochs=6,
batch_size=128,
validation_data=(x_val, y_val))
Train on 14328 samples, validate on 3582 samples
Epoch 1/6
14328/14328 [==============================] - 77s 5ms/step - loss: 0.9943 - acc: 0.6719 - val_loss: 0.5129 - val_acc: 0.8582
Epoch 2/6
14328/14328 [==============================] - 76s 5ms/step - loss: 0.4841 - acc: 0.8571 - val_loss: 0.3929 - val_acc: 0.8841
Epoch 3/6
14328/14328 [==============================] - 77s 5ms/step - loss: 0.3483 - acc: 0.8917 - val_loss: 0.4022 - val_acc: 0.8724
Epoch 4/6
14328/14328 [==============================] - 77s 5ms/step - loss: 0.2763 - acc: 0.9100 - val_loss: 0.3441 - val_acc: 0.8942
Epoch 5/6
14328/14328 [==============================] - 76s 5ms/step - loss: 0.2194 - acc: 0.9259 - val_loss: 0.3014 - val_acc: 0.9107
Epoch 6/6
14328/14328 [==============================] - 77s 5ms/step - loss: 0.1749 - acc: 0.9387 - val_loss: 0.3895 - val_acc: 0.8788
plot_history(history3)
MaxPooling1D