Version 7.2.0 Released Oct 2023
===============================
- Fix for sparse solve with MAGMA with 1 MPI rank
- suppress several warnings
- small bugfix in BLR partitioning in sparse solver
- SYCL CMake compilation fix
- In StructuredMatrix interface, construct_from_elements now takes
  point geometry for faster HSS compression
- In StructuredMatrix interface, BLR compress_and_factor is now
  supported

Version 7.1.4 Released Aug 2023
===============================
- Memory leak fix from MPI_Datatypes
- Set RPATH in strumpack library
- Add counting of subnormal number in the sparse factors
- Changes in the BLR compression tolerances in the sparse solver,
  using a scaled absolute tolerance
- Fixes in the sparse solver using MAGMA, resetting error codes
  which show up in larger systems, but are not actually errors.
- Set SOVERSION
- Update CMake for HIP, using enable_language(HIP), requires CMake 3.21

Version 7.1.3 Released May 2023
===============================
- Workaround for SLATE bug in gemmA: disabling GPU off-loading in
  triangular solve for SLATE <= 20220700

Version 7.1.2 Released May 2023
===============================
- Fix hang for small problems

Version 7.1.1 Released April 2023
=================================
- Compilation fix for ROCm

Version 7.1.0 Released April 2023
=================================
- Bugfix in matrix equilibration code
- Several bugfixes, especially for SLATE and large problems on GPU
- Sparse triangular solve on the GPU when using MAGMA
- Other MAGMA fixes for the sparse direct solver (MAGMA still optional)
- New HSS random sketching operators based on sparse Johnson Lindestrauss
- Fix for HODLRMatrix construction from elements, or blocks
- Compilation fixes for NVHPC compiler
- SYCL updates
- Add lapmr routines, which are not available in Mac LAPACK implementations
- Support newer ( >= 1.0) ZFP versions
- Fixes for clang 15
- Add NDBFS GPU matrix ordering code


Version 7.0.1 Released Oct 2022
===============================
- Require C++14 instead of 17

Version 7.0.0 Released Oct 2022
===============================
- Many bugfixes and general improvements.
- Important fixes in the GPU code, and in the usage of SLATE
  (GPU capable ScaLAPACK replacement).
- The default ordering now uses METIS_NodeND, instead of the
  (undocumented) METIS_NodeNDP routine. This can impact performance,
  or for some problems lead to stack overflow, but for others it
  drastically reduces memory usage. The old behavior can be restored
  with --sp_enable_METIS_NodeNDP.
- Improvements to error handling, mostly related to zero pivots.
- Added new ordering options: AND, MMD, AMD
- Require C++-17


Version 6.3.1 Released March 2022
=================================
- Fix for setting CUDA/HIP device when there are multiple,
  but MPI was not initialized
- Memory leak fix in distributed memory GPU code
- Fixed small memory leaks from MPI datatypes
- Change in BLR algorithm selection options
- Changed default blocksize for 2D block cyclic distribution
  when using SLATE to 512
- Add 64bit support in the matching (MC64)
- Fix installation of Fortran modules


Version 6.3.0 Released February 2022
====================================
- Change default sparsity reducing ordering to use METIS_NodeNDP
  (from METIS_NodeND)
- Significant performance improvements in the GPU code for the
  direct solver, from the NERSC December 2021 Hackathon
- Performance fix in symbolic phase
  (affecting only some MPI implementations)
- Bump minimum CMake version to 3.17
- CMake fix for Perlmutter
- Compilation fix for GCC 8
- Add support for single precision HODLR/Butterfly
  (now requiring ButterflyPACK >= 2.1.0)


Version 6.2.1 Released November 2021
====================================
- HIP compilation fix
- Change default minimum front size for compression to very large
  values, use minimum separator size instead.


Version 6.2.0 Released November 2021
====================================
Bugfix release, with some changes in the C interface.
- Additions to the C interface for the sparse solver:
  . Solve with multiple RHS
  . Specify grid dimensions, stencil, for geometric ordering
  . Fix typo in GPU enabling routine
  . Set lossy precision
- Bugfix in passing grid via options object
- Bugfix in LOSSY/LOSSLESS compression using ZFP
- Compilation fix for nvcc not finding MPI when the CXX/C
  compilers are set to the MPI wrappers
- Fix Fortran module case in CMake when using CCE
- Disable GPU if CUDA/HIP are not enabled


Version 6.1.0 Released October 2021
===================================
- Change default BLR blocksize, and in the BLR preconditioner use
  right-looking algorithm as the default. This has higher peak
  memory usage, but is much faster and more robust.
- Fixes in printing of compression statistics.
- Always build the test when doing 'make', you no longer have to
  run 'make tests'.
- Setup CI at github and gitlab, remove Travis testing.


Version 6.0.0 Released September 2021
=====================================
- Block Low-Rank compression is now supported in the sparse
  solver, resulting in an efficient, scalable and robust
  precoditioner.
- Improved GPU performance for the sparse direct solver
- Add the StructuredMatrix class, for a general interface to
  different rank structured formats. This also comes with a C and a
  Fortran interface.
- Several performance improvements


Version 5.1.1 Released January 22, 2021
=======================================
- Small change in the build script for using HIP/ROCm. See
  example_build tulip for an example of how to build with HIP/ROCm
  support


Version 5.1.0 Released January 21, 2021
=======================================
- Improvements in the distributed BLR code, this now works very
  well as a general purpose preconditioner.
- Added a mixed precision solver class, taking input and output in
  double precision, but doing the factorization in single. Thanks to
  Michael Neuder.
- Several bug fixes.
- Small changes in the HIP/ROCm code.


Version 5.0.0 Released October 12, 2020
=======================================
- Added distributed memory BLR-based preconditioner. For many
  problems, this is much more robust than the HSS or HODLR/HODBF
  preconditioners.
- Several bugfixes in the GPU code.
- Added HIP support (for the direct solver).
- Other bugfixes.


Version 4.0.0 Released August 17, 2020
======================================
 - The CMake build system has been overhauled completely, now using
   modern CMake.
 - Added a Fortran interface for the sparse solver.
 - Updated C interface to the sparse solver.
 - Improved HODLR, with butterfly, preconditioning. Now relying on
   Butterflypack 1.2.0.
 - Much improved CUDA support in the sparse direct solver. CUDA
   acceleration is not working yet for the rank-structured solvers.
 - Added matrix equilibration to the sparse solver. This slightly
   improves numerical accuracy in many cases.
 - Added Lossy compression: --sp_compression lossy, using ZFP.
   Make sure to configure with ZFP support!
 - Removed some options: --sp_enable HSS, --sp_HSS_min_sep_size,
   etc..  Now one should use: --sp_compression HSS,
   --sp_compression_min_sep_size ...
 - Fix for compilation without MPI.
 - Several bug fixes and performance improvements
 - Renamed FC_GLOBAL Fortran to C macro to STRUMPACK_FC_GLOBAL,
   in order to avoid conflicts with SuperLU.
 - Large refactoring of the headers into cpp files for faster
   compilation.
 - Update minimum required version of CombBLAS to 2.0.


Version 3.3.0 Released November 7, 2019
=======================================
 - Initial support for cuBLAS and cuSOLVER.
 - Improved performance in the HODLR (+ butterfly) preconditioner
 - Added a Helmholtz example
 - Change default HSS and HODLR leaf sizes to 512 (from 128)
 - Only print command line option descriptions from the root (of
   MPI_COMM_WORLD)
 - Add ANOVA kernel support
 - Fix 32 bit overflow when calling MPI_Alltoall
 - Rewrite permutations using xlapmr, using manual loops, for
   faster solve
 - Many other fixes and improvements


Version 3.2.0 Released August 15, 2019
======================================
 - Added interface to Hierarchically Off-Diagonal Low-Rank and
   Butterfly matrix codes from ButterflyPACK, see
     https://github.com/liuyangzhuan/ButterflyPACK


Version 3.1.1 Released October 25, 2018
=======================================
 - Fix check for libatomic, needs to be linked explicitly on some
   versions of Clang
 - Do not call the IDEAS: xSDK standards module

Version 3.1.0 Released October 18, 2018
=======================================
 - Changes to the build system for xSDK compliance.
   Third party packages are now enabled using -DTPL_ENABLE_<package>=ON,
   and libraries and includes specified as
   -DTPL_<package>_INCLUDE_DIRS=.. -DTPL_<package>_LIBRARIES=..
 - Always generate position independent code.
 - Suppress many warnings when using clang.
 - Fix possible memory leak.
 - Fix linking issue whith atomic operations.

Version 3.0.3 Released October 4, 2018
======================================
 - Check for support of the OpenMP priority clause (OpenMP 4.5)

Version 3.0.2 Released October 3, 2018
======================================
 - Small fixes in the C interface

Version 3.0.1 Released October 2, 2018
======================================
 - Small bugfixes.


Version 3.0.0 Released September 28, 2018
=========================================
 - The scalability of the sparse HSS preconditioner has been
   drastically improved. Extraction of elements (for the diagonal
   blocks and the B_ij generators) is now done for multiple blocks
   concurrently.
 - Add option to use Combinatorial BLAS approximate weight perfect
   matching instead of MC64. Only works in parallel, requires a square
   number of processes. This currently requires a special version of
   CombBLAS.
 - Fix integer overflow in the counting of nonzeros in the factors.
 - Allow compilation without MPI (and ScaLAPACK, ParMetis..)
 - Add experimental support for Block Low-Rank compression, for both
   sparse (preconditioning) and dense. For now, this is only shared
   memory.
 - Improvements in the build system: In the CMake script, check
   whether OpenMP task dependencies and OpenMP taskloops are
   supported.
 - Removed the CSCMatrix class for compressed sparse column storage,
   this was not tested/used.
 - Revamped website (thanks Lucy), and doxygen documentation/manual
 - Added lots of doxygen documentation/comments. Removed the pdf
   manual, all the documentation is now online at:
   http://portal.nersc.gov/project/sparse/strumpack/master/index.html
   http://portal.nersc.gov/project/sparse/strumpack/v3.0.0/index.html
 - Several bugfixes!


Version 2.2.0 Released March 31, 2018
=====================================
 - Changes in the build system
    - ParMETIS, (PT)Scotch are now optional
 - Work to make STRUMPACK compatible with the xSDK policies
    - Moved header files in subfolders
    - Made interface const correct
 - Support for multiple right-hand sides
 - Improved threading in HSS code, improved performance for the
   hybrid MPI+OpenMP code: many unnecessary matrix copies are
   now avoided
 - Flops are now counted/reported correctly when running with
   MPI and or OpenMP.
 - geometric nested dissection now supports wider stencils and
   multiple degrees of freedom per node (TODO add in MPI code)
 - Many performance and several bug fixes


Version 2.1.0 Released October 26, 2017
=======================================
 - Cleanup in the C interface.
 - Removed (set_)HSS_min_front_size and --sp_hss_min_front_size from
   the manual, it is not supported (yet)
 - In some of the examples, the PBLAS blocksize was set to 3 (for
   debugging). This has been removes, and the default blocksize of 32
   is now used, leading to much better performance for those examples.
 - Small bugfix in the adaptive HSS compression stopping criterion.
 - Disable replacement of tiny pivots by default. It can lead to
   convergence problems for large matrices.


Version 2.0.1 Released October 7, 2017
======================================
 - Critical bug fix in HSS Schur complement update.
 - Some minor performance improvements.
 - Valgrind fixes, compiler warning fixes (when not compiling with
   OpenMP).


Version 2.0.0 Released October 1, 2017
======================================
This is a major revision. From now on STRUMPACK is released as a
single library, including both the sparse and the dense
components. Unfortunately, the dense code is not documented yet.
Additionally, the development version of the code is now available
from the public github repository
  https://github.com/pghysels/STRUMPACK

The main changes since release 1.1.1 are:
 - The options for StrumpackSparseSolver are now set through an object
   of type SPOptions, stored in the StrumpackSparseSolver class.
 - The template parameter for the real type has been removed. It is
   now derived from the scalar type.
 - The HSS code has been completely rewritten, performance is much
   improved, memory leaks and valgrind errors have been fixed.
 - A new, more robust adaptive HSS compression algorithm has been
   implemented. This was developed in collaboration with Theo Mary
   from Université Toulouse.
 - We have automated testing of the code.
 - Many bugs have been fixed. However, some bugs probably still
   remain. If you happen to encounter any problems while running
   STRUMPACK, please do not hesitate to contact us.


Version 1.1.1 Released July 16, 2017
====================================
- Add function STRUMPACK_set_from_options_no_warning_unrecognized(..)
  suppress warning


Version 1.1: Released November 8, 2016
======================================
 - Rewrite reordering code, can now use sequential METIS and SCOTCH
   from distributed interface.
 - Change default minimum HSS front size to 2500 (as used in ipdps17 paper).
 - Performance improvements in HSS code, mainly HSS compression.
 - Add RCM ordering.
 - Adaptive HSS compression in MPI code. This changes the default
   HSS preconditioning strategy to ADAPTIVE.
 - Add option to choose between METIS_NodeND and METIS_NodeNDP.
 - Add a program to read matrix market file and print out binary file.
 - Several other bug fixes.


Version 1.0.4: Released August 4, 2016
======================================
 - Moved examples to the exmples/ folder, deleted test folder.
 - Add pde900.mtx test matrix market file.
 - Add README to examples.
 - Fixes for memory leaks!
 - Fix bug in separator reordering and another fix in distributed
   separator reordering.
 - Print message if LU fails, tell user to try to enable MC64 if not
   already enabled.
 - Include missing stdlib.h.
 - Small performance enhancements in extend-add.
 - Small improvement in proportional mapping.
 - Various performance improvements throughout the code:
   Several HSS algorithms were done serially, first one child, then the other.
 - Added some timers for profiling, enable with -DUSE_TASK_TIMER.
 - Performance improvements in front_multiply_2d (for random sampling
   of front).
 - Add a faq in the manual.
 - Describe all command line options in the manual.
 - Avoid recursion in the e-tree.
 - Change the default relative compression tolerance.


Version 1.0.3: Released June 6, 2016
====================================
 - Fix for building a shared library (thanks to Barry Smith)
 - Fix for complex numbers (thanks to Barry Smith)
 - Add an example folder with an example simple Makefile


Version 1.0.2: Released June 2, 2016
====================================
 - Explain how to tune the preconditioner in the manual.
 - Suppress output from DenseMPI.
 - Print warnings/errors to cerr iso cout.
 - Allow compilation without OpenMP.
 - Fix some compiler warnings.
 - Fizes for (Apple)Clang.
 - Improve CMake detection of ScaLAPACK.


Version 1.0.1: Released May 23, 2016
====================================
 - Small fix for GCC 6.1
 - Update TODO
 - FIX: set min_rand_HSS equal to number of columns in R.
 - Code to print some stats about front size and nr random vectors and ranks.
 - Change to the default rank pattern strategy.
 - Check fgets return code.
 - Update build script, check for ScaLAPACK.
 - Print clear warning if BLAS/LAPACK not found.
 - Update STRUMPACK README.
 - Print correct Metis error code.
 - Micro optimizations.
 - Fix MPI communicator bug.
 - Fix memory leaks.
 - Fix very slow separator reordering for MPIDist interface.
 - Add description of block-distributed CSR to the manual.
 - Use CSR iso triplets in prop dist sparse matrix.
 - Missing return statements.
 - Do not use __gnu_parallel.
 - Fix bug in Redistribute (integer_t -> float).
 - Add 64 bit MPIDist test/example.
 - Add 64 integer test.
 - Check PETSc input for NULL.


Version 1.0.0: Released May 4, 2016
===================================
 - Initial release of the STRUMPACK-sparse code with support for MPI+OpenMP

