![]() |
threaddb
1.0
A file mapped memory container extension
|
Release 1.0 18 September 2019
Copyright (C) 2019 The ThreadDB Project
The motivation for implementing ThreadDB was the recurring need for handling large amounts of data in the environment of IC manufacturing. While the std:: container provided by the stl are powerful and versatile instruments for data management, they require to have all information held in the system memory. Especially for large amounts of data this shortcoming results into conditions where the system runs out of memory. Researching the documentation for available packages the Author did not succeed to find a solution fully covering this demand. The available solutions suffer either on limited database size or performance. This led to the situation, that special solutions have to be implemented for each individual scenario. To provide a more general means handling really large volumes of data (up to hundreds of Gigabytes) efficiently and provide a standardized technique, ThreadDB was born. The library is dedicated to help developers managing their data in an efficient and simple way. Performance is achieved by utilizing multiple threads and sharing data amongst different folders or devices. In addition it was taken care that the interface stays simple and easy to be integrated into existing C/C++ code. The implementation has been designed for minimal additional main and disc memory overhead.
ThreadDB is available for download in precompiled form for Windows x64 and Linux. The author decided to not release the source code to public for keeping better control over the way in which the package is built and utilized. The freely available package is limited with an eye on the amount of data and available threads. This to not run into legal issues e.g. of export control, to encouraging the user to send feedback and to help tracking the application cases in which ThreadDB is used. The Windows package has been built using Microsoft Visual Studio 2019. The Linux version has been built using on Ubuntu-18.04.1 g++ 7.4.0
NOTE: the free ThreadDB version is limited to 100Gb of data and 8 worker threads per process. For legal aspects and warranty be advised to refer to the disclaimer section of this document.
For feedback, comments or to achieve an unlimited version please contact the author via mail thethreaddbproject(at)gmail.com.
The ThreadDB library and header files are available as simple .zip or .tar.gz files for Windows and Linux respectively. After downloading the package unzip the package using e.g. WinZip, 7-Zip or under Linux tar -zxvf. The package contains the required interface header files, documentation and an example for running initial tests on your system. It was taken care, that there are no other 3rd party libraries (like boost) required.
Integrating the library in your project depends on the IDE you are using. For Microsoft Visual Studio it is necessary to add the ThreadDB "\public" folder provided in your local copy of ThreadDB to your project include folder path. Then include the interface header file threaddbC.h for C or threaddbCPP.h for C++ in your program code. Next tell the linker to use the treaddb shared library when building your application. Therefore threaddb.lib has to be added to your linker path and linked libraries. For Visual Studio all required settings are available through the Project Settings dialog. For Linux it is necessary to add the include path to your makefile. This can be achieved by using the -I compiler option pointing to the installation "\public" folder. In addition one has to specify the linker path via -L and add the library using -l to the build command. Before running your application make also sure, that the threaddb.dll or threaddb.so shared libraries are added to your library search path.
To utilize ThreadDB in your application it is first necessary to identify the type of data which should be stored. In general one has to separate between fixed and variable length data items. For fixed length item data no additional information is required. This follows the general principle of the std:: template containers. Since the size of stored data items is not stored in the database itself, variable length data requires intrusive handling of the item size. This could be achieved by either an unique identifier to distinguish different types of data or by a class member providing the item size directly. In both cases usually two read steps are required. In the first step the length (or type) information of the data item is read and in a 2nd step the data section of the stored item.
The handling of the ThreadDB database itself is pretty straight forward. In a first step the database is created using ThreadDB_Create. After this, the desired number of threads have to be started using ThreadDB_NewThread. The function also allows to specify a maximum temporary file size using MaxFileSize_p
. Especially for very large file volumes copying or transferring data can become harmful. Limiting the file size allows to generate more handy file packages that can be transferred in parallel. This option allows also to overcome partition limitations on large file size support. E.g. for FAT32 4Gbyte of file size cannot be exceeded. On success ThreadDB_Create returns the generated temporary database filename assigned to the thread in the parameter pFileName_p
. After the worker threads have been successfully established, one ore more packages can be registered using ThreadDB_NewPackage. A package can be seen as a container, which holds the required item data. Individual packages can be processed (nearly) indpendent, while inserting data items in the same package requires additional synchronization. Each package gets a memory buffer assigned to, which temporarly holds and collects the data items. If the buffer is exceeded, it is written to one of the temporary database files. Due to this optimization, synchronization prior starting to recover the itmes within a package is required using ThreadDB_Synchronize. In the current state of implementation the package management information stored persistently in the memory. Therefore an additional amount of about 180 byte of memory is consumed for each created package.
Data can be stored as desired in the destination package(s) using ThreadDB_Store. This function takes the provided data and generates a store request to first available worker thread. The parameters PackageSize_p
and PackageCacheLimit_p
spezified during the creation of the database define the size of the internal package cache. Since the number of packages could become quite large PackageCacheLimit_p
sets a maximum number of buffered packages. If this limit is exceeded, package buffers are withdrawn and flushed to file. To flush packages the database uses a write timestamp. Packages which are not accessed for the longest time duration are undergoing this purging process first. In a worst case scenario packages exceeding the PackageCacheLimit_p
are accessed randomly which will lead to onging reading and flushing of package information. This process is usually called "thrashing" and could lead to reduced performance. To which temporary database file the package buffer contents is written depends on the thread available for processing. This means, that the package contents is spread amongst different temporary data packages. After finishing filling the packages the interim buffers of the database need to be synchronized calling ThreadDB_Synchronize.
There are three ways to read back the stored data. One way is continuous streaming through the package items. To execute stream reading the package needs to be opened using ThreadDB_Open. This creates a read handle of type threadDB_ReadInfo which can then used for the consecutive read operations with ThreadDB_Recover. After the process of reading has finished, ThreadDB_Close should be called to release allocated space back to the system. The second recovery mode is random access of individual items. Individual item access needs the threadDB_ItemInfo to be available. This handle has to be pre-allocated and provided during the ThreadDB_Store operation. The so created threadDB_ItemInfo is then used by the function ThreadDB_RecoverItem to locate and read the selected data item. Please be aware, that each stored item that needs to be addressed directly has to have its own copy of the threadDB_ItemInfo handle. A suitable way to store this is to select one of the std:: containers. The third mode of reading data is a mixture between continious and random reading. It allows to jump to a data item at any location directly and from there on continue sequential reading. Each time the ThreadDB_RecoverItem function is called, the given threadDB_ItemInfo is moved on to the next entry. This is useful in cases where a large amount of items needs to be handled. In conditions of this kind, only each Nth threadDB_ItemInfo needs to be stored in system memory which reduces consumed memory.
In cases where the contents of the database needs to be relocated to a different location/disc the routine ThreadDB_RelocateFileTo allows to copy or move the temporary database files. The routine is available also while insert and/or read operations are in progress. This is useful especially for conditions where e.g. the disc runs out of space. First the routine ThreadDB_GetFileCount can be used to identify the number of registered files. The calling process can then use any number between [0-N[ to specify a new location. In addition to this it is also possible to get information on the current location of a specific database file. This can be achieved by calling the routine ThreadDB_GetDatabaseFileName. Please be aware, that the number of database files may exceed the current number of registered threads. This because the MaxFileSize_p
has been exceeded or an existing database was imported. Information about the generated filenames may be helpful in case of error conditions for the remove operation to free disc space. It is also possible to relocate database files using multiple threads at once. With an eye on performance it might therefore be helpful to limit the filesize using the parameter MaxFileSize_p
. If a full copy of the database needs to be created following steps are necessary:
RelocationType_p
set to eCopyFileToTo get information about the number of registered packages ThreadDB_GetPackageCount can be used.
subsection step9 Error handling
Error handling is available via the C++ exception mechanism. In case of unintended conditions, the called procedure generates a std::runtime_exception exception. Especially in pure C environments this needs additional care to be taken. To
The examples are taken from the test routine main.cpp used during implementation. In the following the discussion focuses on the most important aspects of inserting and recovering data. The test consists of basically two runs - one with a limited number of packages (30) and a 2nd one with an unlimited number of packes. This also to give an idea on the runtime effects of synchronization overhead due to package flushing.
The following example shows some general workflow of creating and inserting into a threadDB database. This is demonstrated using fixed size string data.
In a first step the tdb::database is created. The parameters PackageSize_p
and PackageCacheLimit_p
are provided by the calling routine. Next, four worker threads are registered at the database using tdb::database::NewThread. In the example, the folder "D:\tmp" is used to hold the temporary database files. Then 123 packages are created using tdb::database::NewPackage.
To demonstrate the capabilities of ThreadDB to handle multiple concurrent threads at once, four threads are generated executing the routine threadStore in parallel. Therefore five threads (four worker threads + the main thread) are utilized to fill the database in parallel calling tdb::database::Store. The package id's for inserting data are somewhat randomized using the term "(iter + iter % 123) % 123". The store operation of the main thread provides also a handle to store the tdb::ItemInfo entries. Later, this allows for random access individual data items.
Finally tdb::database::Synchronize is called to flush the thread buffers to the temporary database files and prepare for reading.
The snippet provides also an exmaple for how to use tdb::database::RelocateFileTo to move the temporary database file with index 0 to a different location. As demonstrated this is available while the asynchronous store operation are in full progress.
The next example now focuses on different ways to recover back data. In the first part, continious stream reading is demonstrated by also utilizing multiple threads. Four threads are started executing the routine tdb::database::Recover. Each thread opens first randomly packages using tdb::database::Open to gain a tdb::ReadInfo handle. Then the data items are read in a loop using tdb::database::Recover. After all data items are read the parameter pReadBytes_p
is zero, indicating that no more data is available. The main thread operates in a similar way.
The 2nd part shows random access to individual data items. Here the previously stored tdb::ItemInfo is used to select specific item data. Looping through the list of tdb::ItemInfo the routine tdb::database::Recover is called. The example also demonstrates how to utilize tdb::database::Replace to modify individual data items.
The following list gives a briev overview on future extensions and improvements of ThreadDB:
Release 2.0
Provide template classes extending basic stl containers.
Release 3.0
Reduce consumed memory even more by flushing also the package header information to disc. This allows to get rid of the ~180 bytes additional memory per package exceeding the PackageCacheLimit_p
parameter.
Release 4.0
Introduce ThreadDB server mode. This major extension will provide means to register processes and threads (agents) from different systems to a central ThreadDB server. With this additional functionality distributed databases up to terabytes of data will become available. In addition, the process becomes independent of local hardware restrictions.
Release 5.0
Add data compression functionality to allow minimize the consumed disc space.
Copyright (c) 2019 The ThreadDB Project All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the <organization> nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.