DB Migration Frameworks
In today’s fast-paced digital landscape, the ability to efficiently migrate databases is crucial for businesses seeking to improve performance, scale operations, or transition to new technologies. Database migration frameworks streamline this process, ensuring that data integrity is maintained while minimising downtime. These frameworks often come with tools and features that automate tasks, enhance security, and provide clear version control, making the migration process not only faster but also more reliable. Three prominent database migration frameworks stand out in this arena: Flyway: This open-source migration tool is known for its simplicity and effectiveness. Flyway employs SQL scripts or Java for migrations, allowing developers to manage version control easily. With its lightweight nature and compatibility with numerous databases, it’s a favorite among teams looking for a straightforward yet powerful solution. Liquibase: Offering a more feature-rich alternative, Liquibase provides a comprehensive suite of tools for database refactoring and migration management. Its XML, YAML, or JSON format for change logs gives developers flexibility and control over the migration process. Liquibase also integrates seamlessly with various CI/CD pipelines, enhancing collaborative workflows and continuous integration practices. DbMate: A newer player in the field, DbMate focuses on simplicity and speed. It’s particularly appealing for developers who favor minimalism, offering an intuitive command-line interface for managing migrations. With support for multiple databases and an emphasis on developer experience, DbMate is quickly gaining traction among startups and small teams. In conclusion, choosing the right database migration framework is crucial for ensuring a smooth transition while maintaining data integrity. Each framework offers unique features and benefits, catering to different project needs and team sizes. By leveraging these tools, organisations can not only enhance their database management but also pave the way for more innovative and scalable solutions in the future. As technology continues to evolve, staying informed about the best practices and tools for data migration will be essential for businesses aiming to thrive in the digital age.
Machine Learning Operations
With the advent of new Artificial Intelligence being released over the last couple of years, companies are beginning to accelerate the production of data science models. What used to be sequestered to companies with niche data strategies is now diffusing to the wider ecosystem. Companies are investing in platforms, processes and methodologies, feature stores, machine learning operations systems, and other tools to increase productivity and deployment rates. MLOps systems monitor the status of machine learning models and detect whether they are still predicting accurately If they’re not, the models might need to be retrained with new data - I will go into more detail on how companies use machine learning in another post. Many of these capabilities come from agencies or external vendors who provide a platform for other, more nascent companies to train and deploy machine learning models. However, some organisations are now developing their own platforms. Although automation is going a long way to bolster productivity and provide broader data science participation, the most notable achievement here is that companies are able to reuse and leverage existing methodologies, data sets, and even entire models.
Distributed SQL Engines
Databases and relational database management systems allow companies to leverage their data and efficiently create, read, update and delete (CRUD) data. A study by McKinsey in 2022 showed that among financial-services leaders, only 13 percent had half or more of their IT footprint in the cloud. As companies use more and more data, the processes that allow for the data to be used efficiently will non doubt need to be optimised. This optimisation may come in the form of distributed SQL engines which allow processing and retrieval of data. Distributed SQL engines were derived from the concept of parallelisation, a method of high performance computing whereby a computer program or system breaks a problem down into smaller pieces to be independently solved simultaneously by discrete computing resources, distributed SQL engines increase compute power by linking multiple database servers under the hood of one RDBMS. This allows companies prioritise the scalability, reliability, and usability of the orchestrating ecosystem while maintaining the robust ACID compliant transactions of a traditional RDMS. Under this hood is a) virtualisation and b) an abstraction layer. This abstraction layer allows users and developers to interact with virtualised resources without needing to understand the intricacies of the underlying hardware, crucially in this context providing data scientists and analysts access across disparate data sources. This means that you can query relational and non-relational data together in a scale-out fashion for better query performance. As such, “distributed” doesn’t just refer to the query itself but also storage and compute. Companies wanting to carry out analytics on terabytes of data will opt for technologies using distributed query engines to optimise performance. The engines are primarily used in intensive OLAP queries and are able to withstand the fragility and inconsistency seen in non-distributed query engine performance. Early, well-known technologies such as Hadoop use parallel processing engines to query and analyse data stored on Hadoop Distributed File System. Many subsequent distributed query engines are based on Hadoop and are used for batch-style data processing. Each distributed query engine varies, with some breaking SQL queries into multiple stages and storing intermediate results in disks and others taking advantage of in-memory and caching. However, looking at this holistically, these technologies are based on MapReduce, a framework for processing "parrallelisable" problems across large datasets using a large number of computers. Collectively, these computers are referred to as clusters (so long as all computers/nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Processing can occur on data stored either in a filesystem (unstructured) like HDFS or in a database (structured). MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to minimize communication overhead. Some of the most well know technology companies leverage significant compute power to provide products and services, taking advantage of the cloud and distributed architectures including the separation of compute and storage. For example companies such as Netflix are known for having microservices that use different kinds of databases based on the capabilities of each database. Some of these microservices rely on datasources such as Hadoop, AWS S3 or data from multiple data sources. Crucially, distributed SQL query engine allows data to be queried from a variety of these sources within a single query. Example query engines include Presto, Apache Drill, Apache Spark. Companies such as Netflix and Uber use this to drive analysis across disparate datasets. The number of companies creating new and innovative services and products to suit business needs continues to increase and this will no doubt make it easier for companies of any size and in any industry to leverage their data.
ML and Business Strategy
There are no doubt numerous techniques in which data can be collected, and even more numerous data sources which store the data itself. However, once data engineers have extracted, transformed and loaded this data into a data store it needs to be analysed. Many companies are choosing to use artificial intelligence and machine learning (ML) to glean insights from their data. The insights supplement decision making or, in some cases, completely automate the decision making process. There are a couple of examples of this: in the healthcare industry triage process, machine learning is used to categorise and prioritise incoming patients. Providing nursing staff with this technology helps increase diagnostic and therapeutic assessments and also enables remote triaging. The most frequently used machine learning models used here are based on XGBoost and Deep Neural Networks; other models have been considered such as Logistic Regression but the performance decreases. One of the main industries that uses ML is financial services. Here, ML is increasingly used by companies to aid decision making and automation of processes. According to the Bank of England, 72% of UK financial services firms are developing machine learning applications, with the insurance and banking sectors setting the pace for absolute usage. Despite this, firms are aware of the constraints of machine learning deployment that arise due to the Prudential Regulation Authority’s and Financial Conduct Authority’s existing regulations lacking clarity. An important point to raise here is that despite this lack of clarity, regulatory authorities need to ensure regulations ensure safe and responsible adoption of machine learning. Within FS, the main business areas in which ML is deployed include customer engagement and risk management and compliance. Customer engagement has the highest percentage of post deployment applications and is seen at various stages throughout the customer lifecycle. There are many other business ares which use it but these are not within in the scope of this blog. Firms deploy ML applications for a variety of uses, however, they can be both from internal and external implementation. They can be externally implemented by third party vendors or co-implemented with third party vendors providing any other the services in the vertical integration. This may be cloud storage, the ML models, software packages or data input. These applications need to monitored and tested to validate performance. The most commonly used methods are outcome monitoring against a benchmark. Here, performance and outputs of the model are compared against historical data. The historical data used will vary depending on the business area or industry but is most commonly profitability, customer satisfaction or pricing. The next most common method is data-quality validation: this is used to detect errors, biases and risks in the data. Once model performance is validated and models are deployed results can be used to aid decision-making. Typically in financial services, ML models are most commonly used in pricing and underwriting, with complex models used for credit pricing and insurance underwriting. These models are at an advanced stage of deployment and are used in expected loss accounting, claims accounting, motoring for insider trading or market manipulation, directing queries within customer interfaces, compliance/AML/KYC checks, trading strategy and execution and payments authorisation. ML has a variety of use cases and is being adopted by more and more firms, not just in financial services but also in other industries such as healthcare. Most business use it to aid in customer engagement, with it being used to classify, predict or optimise data based on customer behaviour, with this in turn used in strategic decision-making. Businesses will no doubt continue to deploy more complex ML models and as these processes become more common and understood they will trickle down to smaller firms.