home

toc

=**Introduction**=

Real-time and embedded systems are in widespread use in the modern world.

From the microprocessor controller in a camera, through "smart" traffic lights and production control systems, to large defense systems, through analogic telephony to the complex digital communication systems, computer technology is increasingly a part of systems that control and respond to their environments in realtime.

As the technology has improved, we have come to rely on these systems more and more - we have even put our lives in their hands.
 * Airplanes, biomedical accelerators, nuclear power plants, and the like all depend on real-time control to operate safely.
 * A failure in a control system, such as not responding correctly to faults in the environment, could endanger many lives.

Unfortunately, we have seen a tendency for developers to focus too heavily on the intricacies of the engineering and computer technology, to the detriment of understanding the real-world problem at hand.
 * At best, this wastes time and resources,
 * But at worst it is dangerous in light of the life-critical nature of today's systems.
 * We believe that this misplaced focus results at least partly from the lack of a comprehensive set of modeling tools and techniques fitted to the real-time development environment.

In [|computer science], **real-time computing** (**RTC**), or **reactive computing**, is the study of [|hardware] and [|software] systems that are subject to a "real-time constraint"—i.e., operational deadlines from event to system response. By contrast, a //non-real-time system// is one for which there is no deadline, even if fast response or high performance is desired or preferred. The needs of real-time software are often addressed in the context of [|real-time operating systems], and [|synchronous programming languages], which provide frameworks on which to build real-time application software.

A real time system may be one where its application can be considered (within context) to be [|mission critical]. The [|anti-lock brakes] on a car are a simple example of a real-time computing system — the real-time constraint in this system is the short time in which the brakes must be released to prevent the wheel from locking. Real-time computations can be said to have //failed// if they are not completed before their deadline, where their deadline is relative to an event. A real-time deadline must be met, regardless of [|system load].

Hard and soft real-time systems
A system is said to be //real-time// if the total correctness of an operation depends not only upon its logical correctness, but also upon the time in which it is performed. The classical conception is that in a **hard real-time** or **immediate real-time system**, the completion of an operation after its deadline is considered useless - ultimately, this may cause a critical failure of the complete system. A **soft real-time system** on the other hand will tolerate such lateness, and may respond with decreased service quality (e.g., omitting frames while displaying a video).

Hard real-time systems are used when it is imperative that an event is reacted to within a strict deadline. Such strong guarantees are required of systems for which not reacting in a certain interval of time would cause great loss in some manner, especially damaging the surroundings physically or threatening human lives (although the strict definition is simply that missing the deadline constitutes failure of the system). For example, a [|car] [|engine] control system is a hard real-time system because a delayed signal may cause engine failure or damage. Other examples of hard real-time embedded systems include medical systems such as heart [|pacemakers] and industrial process controllers. Hard real-time systems are typically found interacting at a low level with physical hardware, in [|embedded systems]. Early video game systems such as the [|Atari 2600] and [|Cinematronics] vector graphics had hard real-time requirements because of the nature of the graphics and timing hardware.

Soft real-time systems are typically used where there is some issue of concurrent access and the need to keep a number of connected systems up to date with changing situations; for example software that maintains and updates the flight plans for commercial [|airliners]. The flight plans must be kept reasonably current but can operate to a latency of seconds. Live audio-video systems are also usually soft real-time; violation of constraints results in degraded quality, but the system can continue to operate.

= = =**Parallelism**= There are several approaches to accelerate the speed of execution, but most of them are based on the idea of [|parallelism]. There are several techniques used for parallel execution of applications. There is parallelism obtain through software threads and parallelism obtain by increasing hardware resources, more exactly increasing execution cores. The hardware approach offers better results, but a combination of software and hardware methods can also greatly reduce the execution time.

Serial processors are no longer improving at the rate they used to.The development of general purpose processors turned to multi-core systems, which use shared memory to hide parallel processors. But the shared memory can not deal with hundreds of thousands of cores, so to achieve such a level of parallelism have been developed alternative approaches. Supercomputers with massive parallelism, had a small success so far, because of the difficulty arising in programming these machines, high prices, so they remain a niche area. But devices specialized for certain areas of applications such as graphical processing units (GPU), routers or FPGA can contain hundreds of computing cores and are successful produced world wide. They improve performance by focusing on several key elements of the algorithm, but failing to be sufficiently flexible to be used for various applications.

GPUs characteristics versus CPUs
In recent years graphical processing units have made the transition from their original role of specialized accelerators, serving mostly gaming industry, to general motors capable of achieving an impressive number of calculations. The figure gives a comparison between rates in GFLOPS of graphics cards from NVIDIA and Intel processor speeds. Analyzing the chart it is clear that massive calculation GPU are several generations ahead, which translates into 5 to 10 years. Floating-Point Operations per Second and Memory Bandwidth for the CPU and GPU From[1] First of all, it is very important to understand the architectural differences between a CPU and a GPU in terms of how the default implementation of instructions and how commends are executed, in order to see the cause of these differences in performance. Current processors have the components up to eight processor cores and multiple cores development will be difficult because of technical difficulties, but also extremely high consumption. Each core can operate independently and can execute different instructions for different processes. Unlike the CPU which operates by MIMD (Multiple Instructions / Multiple Data), GPUs have more cores (480 for top of the range GeForce GTX 295 ) which operates by SIMD (Single Instruction / Multiple Data). The basic idea is that a processor cores are designed to execute a single thread of sequential instructions with maximum speed when the GPU is designed for rapid implementation of several parallel threads of instructions. And besides this parallelization, modern GPUs can run even more instructions in one cycle.

Another thing to consider is memory access. Not all CPUs include integrated memory controller while all GPUs have more integrated controllers (896-bit, 448-bit per GPU peak equips NVIDIA GeForce GTX 295 range). In addition graphical processors use faster memory, as a consequence GPUs are getting more bandwidth which can provide a significant advantage in parallel computations of large data streams. This advantage is illustrated by the figure above.

Differences between the CPU and GPU are also in terms of operations that require multiple threads. Thus the CPU can executed 1-2 threads on each core while the GPU can support up to 1024 threads per core (the figure bellow). Switching from one thread to another is charged in hundreds of cycles for the CPU when the GPU can run multiple passes in each cycle.

 In conclusion, unlike modern processors, GPUs are designed for multiple parallel arithmetic calculations. CPU's have fewer cores running, but running at high frequencies have a large number of instructions and are very flexible, have a considerable cache that provides quick access to data at random memory addresses. GPUs have many execution cores (ATI 4870X2 has 2x 800 cores, arranged in two chips) running at low frequencies, kernels are simple and do not offer the flexibility of the CPU, the cache is reduced, thus access to memory is very quickly just for consecutive memory addresses.

Significant for the comparison between the CPU and GPU is that for both CPU and GPU, the ratio between the number of operations per second and the bandwidth is approximately the same: 7. This means that for the maximum use of computing power of a CPU or a GPU, memory accesses to be much lower than the number of operations on data. For this reason, they say that the performance of computer is memory bound or limited CPU / GPU (CPU / GPU bound).

media type="youtube" key="fKK933KK6Gg" height="385" width="480"

A very intuitive experiment that shows the advantages of GPUs over CPU is presented in the above video.

Programming GPUs
Computing is evolving from "central processing" on the CPU to "co-processing" on the CPU and GPU. To enable this new computing paradigm, NVIDIA invented the CUDA parallel computing architecture that is now shipping in GeForce, ION, Quadro, and Tesla GPUs, representing a significant installed base for application developers.

CUDA is NVIDIA’s parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of the GPU (graphics processing unit). CUDA (Compute Unified Device Architecture acronym) is a compiler and a set of development programs designed to run on NVIDIA GPUs. With the introduction of the NVIDIA CUDA GPU can use for different activities which require high computing power. Programming language is C language used by a set of additional extensions, so it takes both its advantages and disadvantages. The main advantage is simplicity and the spread of this language among programmers, which greatly facilitates development of new applications CUDA. Compatible with other standard programming languages such as C, Java, Fortran and Python, CUDA architecture gives developers extensive opportunities for programming, thereby contributing to the development of applications on the GPU. Thus, at present, CUDA is present on more than 100 million GPUs, improving system performance graphics in a wide variety of applications.

There are some limitations to this set of development [40]. Appear to deviate from the standard IEEE 754 floating point texture rendering and not accepted. Not only can implement functions nerecursive and threads should run in groups of minimum 32 for a substantial performance. But the performance of this technology is most affected by communication latency between the CPU and GPU. These disadvantages disappear with CUDA development. Therefore in CUDA 2.2, a paradigm shift is achieved by implementing APIs for transparent transfer of data between the processor and graphics cards. Totdată latest version of this technology and make other improvements.

CUDA includes two libraries: CUBLAS and CUFFT. CUBLAS is an implementation of Blas (Basic Linear Algebra Subprograms) and provides access to computational resources of NVIDIA GPUs. The library as an API is sufficient for its inclusion in the project, direct interaction with the CUDA driver is necessary. This library includes functions in CUDA implementete the optimal operations with matrices and vectors.

GPUs applications
Applications that can be improved using the computational power of GPUs have certain characteristics. [|Stream processing] — in [|parallel processing], especially in graphic processing, the term stream is applied to [|hardware] as well as [|software]. There it defines the quasi-continuous flow of data which is processed in a [|dataflow programming] language as soon as the [|program state] meets the starting condition of the stream. Stream processing is especially suitable for applications that exhibit three application characteristics. The area of applicability has grown significantly in the last 3 years. The video presents some of the latest domains. media type="youtube" key="ZOGLkl9cFPw" height="385" width="640" CUDA has been enthusiastically received in the area of scientific research. For example, CUDA now accelerates AMBER, a molecular dynamics simulation program used by more than 60,000 researchers in academia and pharmaceutical companies worldwide to accelerate new drug discovery. In the financial market, Numerix and CompatibL announced CUDA support for a new counterparty risk application and achieved an 18X speedup. Numerix is used by nearly 400 financial institutions. In the consumer market, nearly every major consumer video application has been, or will soon be, accelerated by CUDA, including products from Elemental Technologies, MotionDSP and LoiLo, Inc.
 * **Compute Intensity**, the number of arithmetic operations per I/O or global memory reference. In many signal processing applications today it is well over 50:1 and increasing with algorithmic complexity.
 * **Data Parallelism** exists in a kernel if the same function is applied to all records of an input stream and a number of records can be processed simultaneously without waiting for results from previous records.
 * **Data Locality** is a specific type of temporal locality common in signal and media processing applications where data is produced once, read once or twice later in the application, and never read again. Intermediate streams passed between kernels as well as intermediate data within kernel functions can capture this locality directly using the stream processing programming model.

An indicator of CUDA adoption is the ramp of the Tesla GPU for GPU computing. There are now more than 700 GPU clusters installed around the world at Fortune 500 companies ranging from Schlumberger and Chevron in the energy sector to BNP Paribas in banking. And with the recent launches of Microsoft Windows 7 and Apple Snow Leopard, GPU computing is going mainstream. In these new operating systems, the GPU will not only be the graphics processor, but also a general purpose parallel processor accessible to any application. Moreover on NVIDIA website there is a showcase with GPU computing applications developed on the CUDA architecture by programmers, scientists, and researchers around the world**.** = Applications – Handwriting recognition GPU vs CPU =

Handwriting recognition is an important direction of artificial intelligence, with a large and active academic community studying it. Its importance is due to numerous areas of application that has handwriting recognition. Lately there has been a growing number of systems both handwriting recognition in automated processing of paper documents (recognition off-line) and in new ways of man-machine interaction based on usage " pen and touch screen (recognizing "online"). Therefore, recognition should be classified in two categories. The first one, the "online" (or dynamic recognition) requires a device with a sensor detects movement of the writing instrument. Creating or integrating devices and recognizing "online" became cheap, accessible and used by many people. Some examples of such devices are Pocket PC (Palm, PocketPC), tabletPCs (computers without a keyboard, touch screen only), "smartphones" (the latest generation mobile phones that integrate several modes of user interaction). The second category involves recognition text processing image copy of a paper document in form useful word processing software on a computer. Data (characters) obtained in this way be considered a static representation of handwriting.

The implemented application is used to recognize handwritten digits and numbers. The application was implemented for CPU and also for the GPU, in order to compare the results obtained. The algorithm used to implement the application, called SMO (Sequencial Minimal Optimization), uses Support Vector Machines(SVM). SVM are a particular class of algorithms, characterized by the use of kernels, the absence of local minima, the solution dispersed and control capability that is obtained by acting on the margins or other independent parameters, such as the number of support vectors. Were invented by Boser, Guyon and Vapnik and were first introduced at the conference: Computational Learning Theory (English "Computational Learning Theorz" COLT) in 1992. But all these features were presented and used in automatic training since 1960.

**Multiclass SVM **
SVM classifiers are basically binary classifiers. Rarely in real life are situations encountered in which classified data falls into two classes only. Therefore scientists' attention was directed toward finding ways to generalize this type of classifiers. There are two types of strategies to solve SVM. The first one is known as the approach with a single machine, all data considered in the formulation of optimization and one is in building and combining several binary classifiers. In further development designations strategies are these: "One-Against-All" (OAA), "One-Against-One (OAO), all-and-one" (A & O), "Direct acyclic graph SVM (DAGSVM ) methods based on trees and methods with error correction codes output. (ECOC) Of the methods listed, the most common are OAA and OAO. OAA method difference that makes between class and other functions often leads to the estimation of discriminating complex. OAO method decomposes the original problem into smaller problems with binary classes. However, the necessary number of SVM classifiers is n (n-1) / 2, which can lead to a slowdown rather than the time of classification, especially when N is very large. Recently proposed method A & A to improve the accuracy in classification of OAA and the elimination votes wrong OAO, but needs N SVM sites binaries for OAA and SVM binary OAO However for a classification, which involves a longer test than OAA. Jonathan Milgram in his article // “One Against One” or “One Against All”: // // Which One is Better for Handwriting Recognition with SVMs? //, analyses the 2 methods. According to literature, it seems almost impossible to draw a conclusion which of these two methods is better for handwriting recognition. The conclusion reached from the study conducted in the article specified above is that the OAA method seems to have increased the recognition accuracy of figures, while the difference is more obvious strategies for capitalization. OAO method is faster by trained and seems preferable for problems with large numbers of classes.

** “One-against-all” strategy ** This paper addresses the OAA method for multiclass SVM-based classifier. This consists of building a SVM individually for each class, which is trained to distinguish between a class and the remaining classes. Typically, classification of an unknown pattern is done by SVM outputs most of the sites. Number of sites built SVM is equal to the number of classes. They will train all the same number of images and are classified as class pictures identifying positive examples and all others are negative examples. The strategy will be exemplified concretely in a later chapter, which will be shown and snippets of the application. After mapping out each SVM output with a probability defined by a sigmoid function, is to find expressions for the global posterior probability P (ωj | x) as functions of local posterior probabilities P (ωj | fj, j (x))  where fj, j (x) denotes the output of SVM's trained to distinguish class ωj ωj class. Several methods have been proposed in literature to implement the method of "one-Against-all, making the distinction between them in how to consider the above probability function.  Figure below shows the two types for the three classes, binary OAA is represented by thin lines, and division classes at OAA is marked with bold continuous. Fig.1 Classifying a problem using OAA binary and OAA continuous   Strategy "  One-against-one “ is known in literature as pair wise coupling "," all pairs” or "round robin", with slight differences between them. The name comes from the fact that it builds an SVM classifier for each pair of classes. So if there are M classes, where training will result in M phase (M-1) / 2 binary classifiers (Fig 2) .Classification of new data that the system did not consider a training stage is performed, usually using the maximum votes wins strategy (English "max-wins voting). The strategy is driven all M (M-1) / 2 classifiers and corresponding decision functions will choose a class of two grants that "vote", and finally class that gathered the most votes will be class the new example is included. Another classification strategy proposed by Platt, is the direct acyclic graphs (English "Directed Acyclic Graph" abbreviated DAG). She avoids going through all the classifiers in the decision stage, building an acyclic graph with m (m-1) / 2 internal nodes and leaves me, which represent classes. Testing starts at an element node is the root and branch browse graph left or the right branch, depending on the output current node. Fig.2 Separating a problem into classes using OAO This method has advantages and disadvantages compared to the one described above. As demonstrated learning theory, a problem of the OAA approach is that its performance may be compromised due to an unbalanced training set (Gualtieri and Cromp, 1998). Therefore a class separate from the rest retain a certain density classes of the set of training can become a difficult problem. Also  M sub problems corresponding to the classifiers, which are optimization problems, define the entire set of training, and for large set of this effort can lead to too much calculation. Moreover OAO has to build several classifiers, so it may be that greater computing effort. But experiments show that the OAO can be much faster for problems with many classes, because the issues they must resolve each are simple and set the drive is smaller than the FAO [21]. Is subject to debate and accuracy of both methods, results and opinions are divided depending on the values and parameters of SVM, the training set and test.
 * “One-against-one” strategy **

** Comparative look of the 2 strategies ** The research articles arising in concerning the above presented methods, the authors had relatively unanimous opinions on the fact that one cannot speak clearly distinguishable on their effectiveness. Each method has certain superiority in a particular case of application of them. In Jonathan Milgram’s article the two methods are compared from several perspectives.

Complexity in terms of training, it seemed logical that training method OAO take longer because it required more training binary classifiers. This is not necessarily true when based on SVM are binary classifiers. Indeed for SVM training time significantly increase the number of images. Thus, it is easier to train n (n-1) / 2 classifiers, because they represent a relatively easy problem to solve, than to train classifiers in FAO method. Times obtained in Milgram’s article are relevant: for training with images of letters received during the training algorithm implemented using OAO method was 12 times shorter than the time obtained for the training algorithm implemented using OAA. For training with images of figures which he obtained OAO times are 50 times better.

Regarding the complexity of decision making, the situation is about the same. Although it would seem logical that the decision process for OAO take longer due to the higher graders, which involves assessing a greater number of positions, where SVM sites this is not necessarily true. Indeed the complexity of the decision function is directly proportional to the number of support vectors and although the decision process is more complicated if multiclass is considered reasonable complexity proportional to the number of support vectors. In experiments carried out in the article quoted above, the implementation of OAA using support vector number used is greater than the implementation using OAO (48% more for digits and 21% more for words). Another perspective to compare the two methods concerns the recognition accuracy of both methods. OAA method error rate obtained in the article is 0.63% for digits, as opposed to 0.7% for all to OAO figures.

**General description of the application**
For application development we used as a programming language, C + +. One reason we chose C + + is the similarity existing in terms of the language and syntax between the graphics card, making the transition easier language for graphics. Programming environment used is Microsoft Visual Studio 2005, but the application can be run on other platforms like UNIX, due for completion GUI application, which is independent of the operating system. The interface will be presented in a later chapter and was developed in Java. The dataset used for testing belongs to MNIST database; the file structure constituting the databases will be presented as it follows. Databases contain individual handwritten characters, the first being composed of digits, and the second letter. Database contains 60,000 digit images for training set and 10,000 for the set of test images and for 10,000 points for the set of training images and 10,000 test images set.

The application has two parts, one in which only SVM training is made and test sites to drive the percentage obtained, and the second part of the SVM training sites where they can be tested on a set of images clearly unknown for SVM's, not the same as those used in training. As the file structure, application consists of four files C + +, which are divided into two parts as functionality. Two basic features of the application are therefore SVM training sites and classification of a set of test data taken from MNIST database. Functionality is called smoLearn and smoClassify. To get a better picture of the application is shown next picture, what containing the use cases of the application. The user can choose the interface between two options to train SVM to classify sites or data retrieved from the database MNIST. Also, there are choosing to test the sets of numbers or sets of letters, of course can be set the number of training and testing images. There are other parameters that can be chosen by the user, but they will be described in detail in the section on user interface description. ==**Implementation details **==

For the processor application 3 methods were used: OAA, OAO şi DAG Fig.3 Common interface for the three methods Last stage of development of this project was a building an interface for these algorithms. CUBLAS library presence in previous projects restricts the possibilities for such interfaces. Such graphics in C + + or C # options were not valid because of incompatibility, and in these conditions we have chosen Java implementation there CUBLAS libraries called JCUBLAS. From this condition we have implemented in Java code using NetBeans IDE 5.5.1 and I made an interface among other features can show real-time training process. Fig.4 Graphical interface in Java
 * CPU application **

The basic model that is creating an application using CUBLAS-array objects and / or vector space of graphics memory, populating them with data, call a sequence of functions CUBLAS and finally taking the result. To use these features to be included cublas.h header. This is the programming model used for the GPU. We implemented the graphics and image matrix multiplication of a vector, representing an image. Image matrix is very large, having equal number of lines set the drive size and number of columns equal to the number of features (784), and this multiplication is the most expensive portion of the algorithm. Fig.5 Not paralyzed implementation of the function
 * GPU application - **** CUBLAS implementations for the application **

Next figure is the CUBLAS implementation of the function presented above.

Fig.6 CUBLAS implementation of the function

**Experiments**
Experiments are important for assessing the quality of implementation of a learning algorithm. There are two stages in the development of learning techniques: training stage and testing stage. Experiments aimed at training phase aimed primarily the training time. SVM has remarkable properties to engage even with fewer items, but the training time is really high, therefore the experiments will measure the training time and the recognition rate obtained. A short description of the database used to train and test the algorithm is presented as follows. MNIST database (Modified NIST) was created by Cortez Corinna and then amended by Yann LeCun image centering in the center of mass. Contains two sets of images labeled, one for training composed of 60,000 images of digits, and each test consists of 10,000 images. As the name suggests, is a change in the NIST SD 19. Black and white images of NIST format were resized to 20 × 20 pixels while maintaining the look. In the process of resizing the images are represented in grayscale. 20 × 20 pixel image is then placed in a 28 × 28 pixels, by  overlap with the center of mass of the first large image center.
 * MNIST database **

MNIST database contains images of handwritten digits from NIST NIST SD 3 and SD 1. After several experiments with these databases, Yann LeCun found that 3 SD is much easier to recognize than SD because it contains an image written by staff from the census office and SD1 contains images of digits written  high school students. Since the desired results classification system are not influenced by those who wrote those figures, they decided to create a new database by mixing data from SD 1 and SD 3. For this, drive kit built using 30,000 and 30,000 images from an SD image from SD 3. To test kit used by both the SD image 5000 1, as 3 rd and the SD. Training set contains data from approximately 500 people, and crowds of people whose images of figures were used to construct training and test sets were disjoint. Since the SD 1 images per person figures are not together, were used as information in "class" files (which contains data about the origin of the each digit image) to rearrange the figures according to the person who wrote them. The popularity of this data set is justified by its two main features. The first is that although the number of items, unlike most available data sets is sufficiently large, but nevertheless, the whole data set can be easily incorporated into a modern computer RAM. The second feature is that models trained with MNIST dataset can be evaluated intuitively, because people are very good at identifying handwritten figures. Currently, the best models achieve generalization errors between 0.5% and 1.0%, but they suffer from over matching, due to the optimization of the parameters reported in the test kit. MNIST contains both a training kit consisting of 60,000 images of digits, and each test consists of 10,000 images. As the name suggests database was made up of black and white images of NIST. The images were resized and positioned through the center of mass of overlapping images of 20 × 20 pixels in 28 × 28 pixels. Also the representation is in gray. The database is structured in four files: • train-images-idx3-ubyte: training set images, • train-labels-idx1-ubyte: training set labels, • t10k-idx3-ubyte-images: test set images, • t10k-labels-idx1-ubyte: labeling test kit. Examples of images from MNIST database: As is apparent, IDX files are the type, format is used to store one-dimensional and multidimensional arrays. Type of items can be chosen from the most common numeric types, but a file is mandatory that all elements are of the same type. The structure of such a file is as follows: • a code • number of elements of size 0 • number of elements of size 1 • number of elements of size 2 • ... • number of elements of size N • data.

Code is an integer in 4 bytes. The first two bytes are null. The third byte encodes the type of data stored: • 0x08: unsigned byte, • 0x09: Signed byte, • 0x0B: short (2 bytes) • 0x0C: int (4 bytes) • 0x0D: Float (4 bytes) • 0x0E: double (8 bytes). The fourth byte specifies the dimension of the matrix. The data are stored in C, with the last index is the changing size des. All these integers are represented by four bytes in MSB format (non-Intel, the most significant byte is stored at the lowest). Files that are MNIST data sets are a special case of the type described above idx file. The following tables tags have values between 0 and 9 (figure), and the pixels between 0 (white) and 255 (black). The four files of MNIST are structured as follows:
 * 1) ****train-images-idx3-ubyte**
 * Offset || Tip || Valoare || Descriere ||
 * 0000 || 32 bit integer || 0x00000803(2051) || tip date ||
 * 0004 || 32 bit integer || 60000 || număr de imagini ||
 * 0008 || 32 bit integer || 28 || număr de rânduri ||
 * 0012 || 32 bit integer || 28 || număr de coloane ||
 * 0016 || unsigned byte || ?? || pixel ||
 * 0017 || unsigned byte || ?? || pixel ||
 * Xxxx || unsigned byte || ?? || Pixel ||
 * Xxxx || unsigned byte || ?? || Pixel ||


 * 2) ****train-labels-idx1-ubyte**
 * Offset || Tip || Valoare || Descriere ||
 * 0000 || 32 bit integer || 0x00000801(2049) || tip date ||
 * 0004 || 32 bit integer || 60000 || număr de elemente ||
 * 0008 || unsigned byte || ?? || etichetă ||
 * 0009 || unsigned byte || ?? || etichetă ||
 * Xxxx || unsigned byte || ?? || etichetă ||
 * Xxxx || unsigned byte || ?? || etichetă ||


 * 3) ****t10k-images-idx3-ubyte**
 * Offset || Tip || Valoare || Descriere ||
 * 0000 || 32 bit integer || 0x00000803(2051) || tip date ||
 * 0004 || 32 bit integer || 60000 || număr de imagini ||
 * 0008 || 32 bit integer || 28 || număr de rânduri ||
 * 0012 || 32 bit integer || 28 || număr de coloane ||
 * 0016 || unsigned byte || ?? || pixel ||
 * 0017 || unsigned byte || ?? || pixel ||
 * Xxxx || unsigned byte || ?? || Pixel ||
 * Xxxx || unsigned byte || ?? || Pixel ||


 * 4) ****t10k-labels-idx1-ubyte**
 * Offset || Tip || Valoare || Descriere ||
 * 0000 || 32 bit integer || 0x00000801(2049) || tip date ||
 * 0004 || 32 bit integer || 60000 || număr de elemente ||
 * 0008 || unsigned byte || ?? || etichetă ||
 * 0009 || unsigned byte || ?? || etichetă ||
 * Xxxx || unsigned byte || ?? || etichetă ||
 * Xxxx || unsigned byte || ?? || etichetă ||

Collection was done using NIST forms which were completed by students from SUPSI and people from IDSI. IDSIA1 was extracted from black and white and IDSIA2 was used for color information, so they are much less segmented figures because they were written over the edges of fields. Experiments made with IDSIA1 show that IDSIA2 is more recognizable than IDSIA1, with better quality figures. This is due to improvements extraction method using color information from form links. Persons who have completed the forms were asked to write a different color to black.

The CPU version of the application has a role of comparative presentation of times and recognition rates with those obtained on the GPU. To optimize the processor version we used a cache error, and a vector in which we incorporated the Lagrange multipliers whose images are "non-bound." C parameter || CPU || 100 || 67.42 || 26.12 || 11.03 || 51.25 || 10 || As can be seen from the table for digits recognition rates are much higher than rates for letter recognition. This is the method chosen to implement SVM's, namely "One-Against-All". Also, better results are obtained if training and testing is done on a balanced set of letters. As might be expected to figure during training is much shorter than for letters. If the figures we have 10 classifiers, while the letters we have 26 classifiers. Execution time for SVM increases exponentially with the number of images in training set and the number of classifiers. The following table presents the results of tests for variant CUBLAS application in which times are best obtained. C parameter || CUBLAS || 100 || 67.75 || 26.45 || 1.18 || 2.2 || 10 ||
 * OAA strategy **
 * Application || Number of images |||| Recognition rate (%) |||| Time (seconds) ||
 * ^  ||^   || Digits || Letters || Digits || Letters ||^   ||
 * ^  || 500 || 82.50 || 53.67 || 539.43 || 2579.34 || 10 ||
 * ^  || 1000 || 83.18 || 58.98 || 1756.15 || 4879.22 || 10 ||
 * ^  || 2000 || 83.79 || 62.25 || 4573.97 || 8709.32 || 10 ||
 * ^  || 3000 || 84.25 || 68.43 || 9873.34 || 12436.32 || 10 ||
 * Application || Number of images |||| Recognition rate (%) |||| Time (seconds) ||
 * ^  ||^   || Digits || Letters || Digits || Letters ||^   ||
 * ^  || 500 || 82.40 || 54.06 || 4.1 || 7.62 || 10 ||
 * ^  || 1000 || 83.2 || 59.27 || 12.09 || 41.31 || 10 ||
 * ^  || 2000 || 83.79 || 62.20 || 243 || 939.43 || 10 ||
 * ^  || 3000 || 85.25 || 67.56 || 479 || 1404.26 || 10 ||

This CUBLAS version get the best times, because the calculations largely occupied during the execution is done once at the beginning of learning. Of course calculations are performed in parallel by the graphics card and load all the data plate occurs once at the beginning of learning. Steps to be followed to use the GPU can perform calculations: • loading data on the GPU • make calculations in parallel • downloading data back to CPU Working with graphics cards, the most costly in terms of execution time are accesses to be made plate. Accesses data is passing the plate on the processor and vice versa. For this version of CUBLAS, these operations do not occur more than once, which justifies the very good times that we obtained. Calculation to be carried on board also takes less, because it made parallelized. The only disadvantage of this method is to limit the number of images from the set of training due to lack of memory space. Matrix structures used can reach the maximum size of 60 000 * 60 000. How has the 1GB card can store up to an array of size 15 000 * 15 000. But these problems do not occur in the second variant CUBLAS. Also for this strategy was developed another CUBLAS and CUDA variants. A comparison between this is presented above: The graph below outlines the relative time obtained for the three versions of applications developed for GPU.
 * Application || Number of images || Recognition rate (%) || Time (seconds) || Optimization compared to CPU ||
 * CUBLAS 1 || 1000 || 83.2 || 12.09 || 145,25x ||
 * CUBLAS 2 || 1000 || 83.38 || 507.65 || 3,45x ||
 * CUDA || 1000 || 83.38 || 517.27 || 3,39x ||


 * OAO strategy **


 * Training set dimension |||| Time (seconds) |||| Recognition rate ||
 * ^  || CPU || GPU || CPU || GPU ||
 * 10 images || 0,48 || 1,605 || 62,11 || 61,50 ||
 * 100 images || 5,43 || 15,183 || 85,28 || 85,31 ||
 * 300 images || 627,412 || 129,188 || 88,37 || 88,24 ||
 * 500 images || 2574,730 || 403,388 || 89,63 || 89,83 ||
 * 1000images || 19162 || 1806 || 90,86 || 90,73 ||

=Conclusions=

In the period of ubiquitous and pervasive systems, real time systems are a very important field. Since CPUs do not evolve, so fast as they used to, there is a shift towards parallel programming and more and more systems are developed this way. GPUs can improve the execution time for an application 100x times for many applications, that is why there are used more and more intensive.

=References=
 * 1) NVIDIA CUDA Programming Guide [[file:NVIDIA_CUDA_Programming_Guide_2.2.pdf]]
 * 2) Presentation [[file:ParallelismRealTime.ppt]]