Human Detection in Video Surveillance

Recognition of the human activities in videos has gathered numerous demands in various applications of computer vision like Ambient Assisted Living, intelligent surveillance, Human-Computer interaction. One of the most pioneering techniques for Human Detection in Video Surveillance based on deep learning and this project mainly focuses on various approaches based on that. This paper provides an idea of solution to use video surveillance more effectively, by detecting any humans present and notifying the concerned people. The deep learning model, preferred for fast computation, Convolution Neural Network is used by stacking 3 blocks of layers on fully connected layers. This provided an identification of humans and naïve approach to eliminate inanimate human like objects such as mannequins.


Introduction
Human Activity detection is a major problem in smart videos surveillance. It is an elementary drawback in computer vision, i.e. to notice the activity of human in International Journal of Applied Sciences and Smart Technologies Volume 3, Issue 1, pages 27-34 p-ISSN 2655-8564, e-ISSN 2685-9432 28 surveillance videos. These applicants need real time detection performance, but it is generally very time consuming to detect the actual activity. Since the use of CCTV, the cases of forced entries and robberies have decreased drastically. But the delay in response to such cases can cause problems. If the owner can get the notification of such events, the culprit can be caught red handed. It becomes important to alert the user by detecting what activity is been performed by the subjects [1][2][3].

Research Methodology
This prospective implementation was carried out using simple programming tools and cloud resources. The Convolutional Neural Network (CNN) is the most promising network to work with images and videos. Hence, developing an architecture using CNN was an optimal and efficient choice. Implementation Design. In order to implement the system Modified AlexNet design which is trained on frames of video has been used.
Dataset size. 8 videos have been used as a dataset.
Sample size calculation. The sample size was chosen from multiple videos which satisfies the needs of the required datasets. Each video chosen have average of 8000 frames from which about 10% are taken into consideration. This is to reduce redundancy of the data.
Subjects and selection method. The dataset is formed of videos which are taken using CCTV cameras. These videos all include people trying to break into the shops and houses. Some videos also include mannequins and are taken mostly at night. The dataset are labeled according to visibility of humanity. Grayscale images. This method converts or compresses the three channels of RGB to a single channel. This single channel contains the values of luminance. Luminance can also be described as brightness or intensity, which can be measured on a scale from black (zero intensity) to white (full intensity). Therefore, the output will have the monochromatic range of black and white.

International Journal of Applied Sciences and Smart Technologies
Most of the theft and break-ins occur at night; hence the images will be dark and will be not clear. To brighten up the image techniques like histogram equalization, alpha and beta transformation can be used. We choose histogram equalization to brighten up the images. Histogram equalization improves the contrast of the image by spreading out the most frequent intensity values.
To remove the noise from the images, blurs are used. This reduces the sharpness of the image and smoothen it. Gaussian blur is the most popular blur and is used for processing. Blur also helps in detection of edges and for thresholding. Thresholding The CNN architecture used is a modified AlexNet. The input is a series of 3 continuous frames to help whether the entity is a human or a human like mannequin.

International Journal of Applied Sciences and Smart Technologies
Due to this, each frame in the input stack is has its own CNN layers. The features extracted or output of the CNN layers are concatenated and given to the fully connected network. The classification of the images is done by using the softmax activation layer.

Results and Discussion
The model classifies the data properly at the accuracy rate of 87%. This accuracy is measured by feeding the test data containing both positive and negative labelled images.
From the predicted labels, the number of correctly labelled data, positive and negative both, is divided by the total number of data gives the accuracy of the model. The model is trained also in the way that it does not detect mannequins. The model implementation uses android GUI to alert the user of the CCTV and system. This will help the damage done due to the robbery or catch the intruder.
The model is fast and efficient but the delay due to cloud and pre-processing hamper the performance a little bit. This can be neglected by using faster network speed and faster hardware. Example. Figure 4 shows the correct prediction on the GUI of the system. This depicts the notification and alert used in the system.  Environment sensing is the process of detecting a change in the position of an object relative to its surroundings or a change in the surroundings relative to an object. The performance of the system can be enhanced by detecting the changes in its surrounding and it can adapt to the change at the same time. For example, if there is any moment in the shop after closing then the system will alert the user about suspicious activity by send alert message that can user take action on it. Our main goal was to detect human at low visibility due to night time.

International Journal of Applied Sciences and Smart Technologies
The deconstruction of Implementation is as follows: 1. Initially, the input video is taken from video surveillance.
2. This video is processed by the video processing which is used to detect the human activity in the video by frame by frame.
3. The output video is provided to the network to identify the human detection using CNN model. 4. The output of model is sent to user to alert about human activity to take action via application.

Conclusion
The accuracy of actually catching a robbery is not calculated in the study but this will reduce the success rate. The purpose of project is to achieve goal to find techniques for