Predicting bitcoin’s Value Using Convolution neural networks & Long short term memory cells !


What inspired Convolutional Networks?

CNNs are biologically-inspired models inspired by research by D. H. Hubel and T. N. Wiesel. They proposed an explanation for the way in which mammals visually perceive the world around them using a layered architecture of neurons in the brain, and this in turn inspired engineers to attempt to develop similar pattern recognition mechanisms in computer vision.

In their hypothesis, within the visual cortex, complex functional responses generated by “complex cells” are constructed from more simplistic responses from “simple cells’.

For instances, simple cells would respond to oriented edges etc, while complex cells will also respond to oriented edges but with a degree of spatial invariance.

Receptive fields exist for cells, where a cell responds to a summation of inputs from other local cells.

The architecture of deep convolutional neural networks was inspired by the ideas mentioned above

  • local connections
  • layering
  • spatial invariance (shifting the input signal results in an equally shifted output signal. , most of us are able to recognize specific faces under a variety of conditions because we learn abstraction These abstractions are thus invariant to size, contrast, rotation, orientation

However, it remains to be seen if these computational mechanisms of convolutional neural networks are similar to the computation mechanisms occurring in the primate visual system

  • convolution operation
  • shared weights
  • pooling/subsampling


How does it work?




Step 1 – Prepare a dataset of images



  • Every image is a matrix of pixel values.
  • The range of values that can be encoded in each pixel depends upon its bit size.
  • Most commonly, we have 8 bit or 1 Byte-sized pixels. Thus the possible range of values a single pixel can represent is [0, 255].
  • However, with coloured images, particularly RGB (Red, Green, Blue)-based images, the presence of separate colour channels (3 in the case of RGB images) introduces an additional ‘depth’ field to the data, making the input 3-dimensional.
  • Hence, for a given RGB image of size, say 255×255 (Width x Height) pixels, we’ll have 3 matrices associated with each image, one for each of the colour channels.
  • Thus the image in it’s entirety, constitutes a 3-dimensional structure called the Input Volume (255x255x3).


Step 2 – Convolution



A convolution is an orderly procedure where two sources of information are intertwined.

  • A kernel (also called a filter) is a smaller-sized matrix in comparison to the input dimensions of the image, that consists of real valued entries.
  • Kernels are then convolved with the input volume to obtain so-called ‘activation maps’ (also called feature maps).
  • Activation maps indicate ‘activated’ regions, i.e. regions where features specific to the kernel have been detected in the input.
  • The real values of the kernel matrix change with each learning iteration over the training set, indicating that the network is learning to identify which regions are of significance for extracting features from the data.
  • We compute the dot product between the kernel and the input matrix. -The convolved value obtained by summing the resultant terms from the dot product forms a single entry in the activation matrix.
  • The patch selection is then slided (towards the right, or downwards when the boundary of the matrix is reached) by a certain amount called the ‘stride’ value, and the process is repeated till the entire input image has been processed. – The process is carried out for all colour channels.
  • instead of connecting each neuron to all possible pixels, we specify a 2 dimensional region called the ‘receptive field[14]’ (say of size 5×5 units) extending to the entire depth of the input (5x5x3 for a 3 colour channel input), within which the encompassed pixels are fully connected to the neural network’s input layer. It’s over these small regions that the network layer cross-sections (each consisting of several neurons (called ‘depth columns’)) operate and produce the activation map. (reduces computational complexity).





Step 3 – Pooling


  • Pooling reducing the spatial dimensions (Width x Height) of the Input Volume for the next Convolutional Layer. It does not affect the depth dimension of the Volume.
  • The transformation is either performed by taking the maximum value from the values observable in the window (called ‘max pooling’), or by taking the average of the values. Max pooling has been favoured over others due to its better performance characteristics.
  • also called downsampling

Step 4 – Normalization (ReLU in our case)

alt text

Normalization (keep the math from breaking by turning all negative numbers to 0) (RELU) a stack of images becomes a stack of images with no negative values.

Repeat Steps 2-4 several times. More, smaller images (feature maps created at every layer)

Step 5 – Regularization

  • Dropout forces an artificial neural network to learn multiple independent representations of the same data by alternately randomly disabling neurons in the learning phase.
  • Dropout is a vital feature in almost every state-of-the-art neural network implementation.
  • To perform dropout on a layer, you randomly set some of the layer’s values to 0 during forward propagation.


Step 6 – Probability Conversion

At the very end of our network (the tail), we’ll apply a softmax function to convert the outputs to probability values for each class.


Step 7 – Choose most likely label (max probability value)


These 7 steps are one forward pass through the network.

So how do we learn the magic numbers?

  • We can learn features and weight values through backpropagation



The other hyperparameters are set by humans and they are an active field of research (finding the optimal ones)

i.e – number of neurons, number of features, size of features, poooling window size, window stride

When is a good time to use it?

  • To classify images
  • To generate images (more on that later..)


But can also be applied to any any spatial 2D or 3D data. Images. Even sound and text. A rule of thumb is if you data is just as useful if you swap out the rows and columns, like customer data, then you can’t use a CNN.

Recurrent neural networks !!

What is a Recurrent Network?

Recurrent nets are cool, they’re useful for learning sequences of data. Input. Hidden state. Output. alt text

It has a weight matrix that connects input to hidden state. But also a weight matrix that connects hidden state to hidden state at previous time step. alt text

So we could even think of it as the same feedforward network connecting to itself overtime (unrolled) since passing in not just input in next training iteration but input + previous hidden state alt text

The Problem with Recurrent Networks

If we want to predict the last word in the sentence “The grass is green”, that’s totally doable. alt text

But if we want to predict the last word in the sentence “I am French (2000 words later) i speak fluent French”. We need to be able to remember long range dependencies. RNN’s are bad at this. They forget the long term past easily. alt text

This is called the “Vanishing Gradient Problem”. The Gradient exponentially decays as its backpropagated alt textalt text

There are two factors that affect the magnitude of gradients – the weights and the activation functions (or more precisely, their derivatives) that the gradient passes through.If either of these factors is smaller than 1, then the gradients may vanish in time; if larger than 1, then exploding might happen.

But there exists a solution! Enter the LSTM Cell.

The LSTM Cell (Long-Short Term Memory Cell)

We’ve placed no constraints on how our model updates, so its knowledge can change pretty chaotically: at one frame it thinks the characters are in the US, at the next frame it sees the characters eating sushi and thinks they’re in Japan, and at the next frame it sees polar bears and thinks they’re on Hydra Island.

This chaos means information quickly transforms and vanishes, and it’s difficult for the model to keep a long-term memory. So what you’d like is for the network to learn how to update its beliefs (scenes without Bob shouldn’t change Bob-related information, scenes with Alice should focus on gathering details about her), in a way that its knowledge of the world evolves more gently.

It replaces the normal RNN cell and uses an input, forget, and output gate. As well as a cell state alt text

alt textThese gates each have their own set of weight values. The whole thing is differentiable (meaning we compute gradients and update the weights using them) so we can backprop through it

We want our model to be able to know what to forget, what to remember. So when new a input comes in, the model first forgets any long-term information it decides it no longer needs. Then it learns which parts of the new input are worth using, and saves them into its long-term memory.

And instead of using the full long-term memory all the time, it learns which parts to focus on instead.

Basically, we need mechanisms for forgetting, remembering, and attention. That’s what the LSTM cell provides us.

Whereas a vanilla RNN uses one equation to update its hidden state/memory: alt text

Which piece of long term memory to remember and forget? We’ll use new input and working memory to learn remember gate. Which part of new data should we use and save? Update working memory using attention vector.

  • The long-term memory, is usually called the cell state,
  • The working memory, is usually called the hidden state. This is analogous to the hidden state in vanilla RNNs.
  • The remember vector, is usually called the forget gate (despite the fact that a 1 in the forget gate still means to keep the memory and a 0 still means to forget it),
  • The save vector, is usually called the input gate (as it determines how much of the input to let into the cell state),
  • The focus vector, is usually called the output gate )

Use cases




The most popular application right now is actually in natural language processing which involves sequential data such as words, sentences, sound spectrogram, etc. So applications with translation, sentiment analysis, text generation, etc.

In other less obvious areas there’s also applications of lstm. Such as for image classification (feeding each picture’s pixel in row by row). And even for deepmind’s deep Q Learning agents.



  1. Build RNN class
  2. Build LSTM Cell Class
  3. Data Loading Functions
  4. Training time!

alt text


Predicting rise or fall in bitcoin’s value using CNN and LSTM !!

Prerequisites !

Function to split data in Train and Test set !

import numpy as np


def shuffle_in_unison(a, b):
# courtsey
assert len(a) == len(b)
shuffled_a = np.empty(a.shape, dtype=a.dtype)
shuffled_b = np.empty(b.shape, dtype=b.dtype)
permutation = np.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b

def create_Xt_Yt(X, y, percentage=0.9):
p = int(len(X) * percentage)
X_train = X[0:p]
Y_train = y[0:p]

X_train, Y_train = shuffle_in_unison(X_train, Y_train)

X_test = X[p:]
Y_test = y[p:]


return X_train, X_test, Y_train, Y_test


def remove_nan_examples(data):
newX = []
for i in range(len(data)):
if np.isnan(data[i]).any() == False:
return newX

Importing Necessary functions !

import pandas as pd
import matplotlib.pylab as plt


from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.recurrent import LSTM, GRU
from keras.layers import Convolution1D, MaxPooling1D, AtrousConvolution1D, RepeatVector
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, CSVLogger
from keras.layers.wrappers import Bidirectional
from keras import regularizers
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import *
from keras.optimizers import RMSprop, Adam, SGD, Nadam
from keras.initializers import *


import seaborn as sns


Reading Dataset using Pandas !!

data_original = pd.read_csv(r’C:\Users\indva\Downloads\crypto-analysis\bitcoin\bitcoin.csv’)[::-1]

Output :


Converting Attributes of the dataset into the list !

openp = data_original.loc[:, ‘Open’].tolist()
highp = data_original.loc[:, ‘High’].tolist()
lowp = data_original.loc[:, ‘Low’].tolist()
closep = data_original.loc[:, ‘Close’].tolist()

Output :-

2017-08-23 (2)

We need input data into the chunks of window to be inserted in the CNN and LSTM model . Also, Normalizing the Data And defining the output labels based on values of closed price !

STEP = 1


X, Y = [], []
for i in range(0, len(data_original), STEP):
o = openp[i:i+WINDOW]
h = highp[i:i+WINDOW]
l = lowp[i:i+WINDOW]
c = closep[i:i+WINDOW]


o = (np.array(o) – np.mean(o)) / np.std(o)
h = (np.array(h) – np.mean(h)) / np.std(h)
l = (np.array(l) – np.mean(l)) / np.std(l)
c = (np.array(c) – np.mean(c)) / np.std(c)

x_i = closep[i:i+WINDOW]
y_i = closep[i+WINDOW+FORECAST]


last_close = x_i[-1]
next_close = y_i


if last_close < next_close:
y_i = [1, 0]             # Label when closing price increases .
y_i = [0, 1]             # Label when closing price decreasing


x_i = np.column_stack((o, h, l, c))


except Exception as e:





.Output :-

2017-08-23 (1)

Using Data splitting function defined before And creating chunks of input dataset !!

X, Y = np.array(X), np.array(Y)
X_train, X_test, Y_train, Y_test = create_Xt_Yt(X, Y)  #Splitting Train and test dataset

# Reshaping the Dataset.

X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], EMB_SIZE))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], EMB_SIZE))

Output :-

2017-08-23 (3)

Defining Our Convolution (one Dimensional) model with LSTM cells.

model = Sequential()
model.add(Convolution1D(input_shape = (WINDOW, EMB_SIZE),












opt = Nadam(lr=0.002)


reduce_lr = ReduceLROnPlateau(monitor=’val_acc’, factor=0.9, patience=30, min_lr=0.000001, verbose=1)
checkpointer = ModelCheckpoint(filepath=r’C:\Users\indva\Downloads\Algo trading\algo.hdf5′, verbose=1, save_best_only=True)




history =, Y_train,
epochs = 400,
batch_size = 50,
validation_data=(X_test, Y_test),
callbacks=[reduce_lr, checkpointer],


pred = model.predict(np.array(X_test))


2017-08-23 (4)

Visualizing Our predictions !!

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
C = confusion_matrix([np.argmax(y) for y in Y_test], [np.argmax(y) for y in pred])


print (C / C.astype(np.float).sum(axis=1))


# Classification
# [[ 0.75510204  0.24489796]
#  [ 0.46938776  0.53061224]]


# for i in range(len(pred)):
#     print Y_test[i], pred[i]


plt.title(‘model loss’)
plt.legend([‘train’, ‘test’], loc=’best’)


plt.title(‘model accuracy’)
plt.legend([‘train’, ‘test’], loc=’best’)




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s