Category Archives: Data

How to count nulls and hard-coded text that signifies null in a Pandas DataFrame

Use case: sometimes files are processed were nulls are represented with text like “NULL” meaning the field is not actually empty or null.

Below are some python functions and a test demonstrating functionality.

def getListOfMissingValues():
    """
    desc: List of common words used to represent null that are often found in files as text
    """
    lst = ['NaN', 'NAN', 'nan', 'null', 'NULL', 'nul', 'NUL', 'none', 'NONE', '', ' ', '	']
    return lst
	
def advanceMissingValues(df):
    """
    desc: Count nulls and hardcoded text that represents nulls
    param p1: DataFrame name
    return: DataFrame of field names and count values
    """
    lstMissingVals = getListOfMissingValues()
    col_list = getListOfFieldNames(df)
    output = pd.DataFrame(col_list)
    output.rename(columns = {0:'FieldName'}, inplace = True)
    output['Count'] = ''
    
    #For each field name count nulls and other null type values
    for col in col_list:
        nullCnt = df[col].isnull().sum(axis=0)
        #For each missing value perform count on column
        missValCnt = 0
        for missVal in lstMissingVals:
            missValCnt = missValCnt + len(df[(df[col]==missVal)])
 
        cntTotal = nullCnt + missValCnt
        output.loc[output['FieldName'] == col, 'Count'] = cntTotal

    return output

#Test Setup
lst = ['NaN', 'NAN', 'nan', 'null', 'NULL', 'nul', 'NUL', 'none', 'NONE', '', ' ', '	' ,None]
mdf = pd.DataFrame(lst)
mdf.rename(columns = {0:'NullTypes'}, inplace = True)
print(mdf)

#Run Test
chk = advanceMissingValues(mdf)
chk

Sample output:

How to get a substring between two characters with T-SQL

This is a very common activity in the data world, i.e. there’s some data in a text string you need and the rest of the data in the string is just in your way. Some use cases might be you have a reference in a filename you need to extract, or you may need a snippet of data to create a composite key, or there’s an order number surrounded by other data that is not relevant to your needs etc.

The following is some simple T-SQL that will extract the data you want from a text string providing the data has specific delimiting characters on each side of it.

/*Delimiter variables, first and second position*/
DECLARE @dfp AS CHAR(1);
DECLARE @dsp AS CHAR(1);
DECLARE @text VARCHAR(MAX);

SET @dfp = ';';
SET @dsp = '@';
SET @text = 'I want you to ;Extract this@ substring for me please.';

SELECT SUBSTRING(@text, (CHARINDEX(@dfp, @text) + 1), (CHARINDEX(@dsp, @text) - 2) - CHARINDEX(@dfp, @text) + Len(@dsp))

An icon depicting a calendar and clock

How to format SQL Server datetime as dd/mm/yyyy hh:mm:ss

If you are exporting the results of a SQL Server query to excel typically the recipient of the file wants the dates referenced in the format “dd/mm/yyyy hh:mm:ss” or “dd/mm/yyyy” not in the usual database format yyyy-mm-dd.

The below query formats the datetime as desired. Note that the letter m representing month is capitalised. If they are not the engine will interpret the lowercase letter m as minute so you will end up with days, minutes, years.

Also not that the letter h representing the hours is also capitalised. Capitalising the h makes the time output with the 24 hour format. Lowercase h will be 12 hour format. It is highly recommended not to use the lowercase h.

SELECT FORMAT(GETDATE(), 'dd/MM/yyyy HH:mm:ss', 'en-us')

If you only want the date and not time just remove the relevant text, i.e. just date dd/MM/yyyy or datetime without second dd/MM/yyyy HH:mm.

An icon for a database showing internal waves

Do you need a Data Lake?

Summary

Among data specialist that do not work in the field of Big Data there can be confusion surrounding the term Data Lake. This is because there is apparent overlap in terms of role and function between Data Lakes and, the more traditional, Data Warehouses the likes of which data professionals will be more familiar with. This confusion is not helped by the term Data Lake itself being overloaded which will be discussed later in this article. However despite this overlap Data Lakes do occupy their own distinct role and perform functions Data Warehouses cannot.

Data Lakes have tremendous utility but damagingly there is also a mass of literature surrounding Data Lakes pushing the concept as a cure-all that coincidentally will also require you to migrate your organizations Business Intelligence center into the cloud. The following statements will hopefully dispel some of the associated hucksterism.

  • Data Lakes are not Data Warehouses 2.0, i.e. they are not the evolution of a Data Warehouse.
  • Data Lakes have not replaced Data Warehouses in performing the role of housing aggregated data.
  • Data Lakes will not free you from the burden of developing ETLs or establishing robust Data Architecture and strong Data Governance.

Introduction

It is important to first clarify that both Data Warehouses and Data Lakes are abstract concepts independent of any particular software or vendor. A Data Warehouse can be created in any database engine such as SQL Server, PostgreSQL, Oracle or MySql. Similarly a Data Lake can be deployed across any suitably large data storage platform, i.e. an on-site data center or hosted in the cloud.

In basic terms both Data Warehouses and Data Lakes can be thought of as the place where all data relevant to an organization’s goals is pulled together from various sources both internal and external (increasingly external). They both exist to facilitate an all encompassing view of an organization and how will it performs or provide a greater understanding of the organization’s environment, opportunities (e.g. customer preferences and habits) and threats. However they differ in terms of the data they are optimized to handle and are therefore better suited to different use cases.

What is a Data Warehouse?

A Data Warehouse is a method for storing and organising data that is optimized to support Business Intelligence (BI) activities such as analytics. To put it another way they solely exist and are constructed in a manner to best answer big questions efficiently. For this reason they typically hold vast quantities of historical data. The data within a data warehouse is usually derived from a wide range of sources such as application log files but primarily transaction applications (Oracle, 2019). However in contrast to a transactional database were each transaction is a separate record, the relevant entries in a Data Warehouse are typically aggregated although they can also hold transaction records for archival purposes.

Figure 1: Typical Data Warehouse architecture of an SME (Databricks, 2019)

Single transaction records on their own are not typically very insightful to an organization, trying to identify consumer trends for example. Aggregating data based on facts and dimensions, e.g. the number of sales (fact) for a particular store (dimension), saves disk space and allows queries looking for that specific answer to be returned quickly. Data Warehouses mostly contain numeric data which is easily manipulated. As an example store sales might be the summation of thousands of rows of data to a single row.

Figure 2: Simplified example of a Data Warehouse internal structure (BIDataPro, 2018)

Data Warehouses also solve the problem of trying to derive information when there are too many sources, e.g. a multinational with thousands of store locations and subsidiaries, by creating a “single source of truth”. Effectively this means pulling all the information to one central location, transforming the data for uniformity and storing like for like data together. For example this could mean gathering all sales data from multiple locations and converting the currency to dollars. All of the data in one place together allows for different sources, which serve different purposes, to be combined via a single query. For example a report that links sales data and logistical data, coming from POS and SCM systems respectively, may not be possible with a single query if the systems are not linked. If best practices regarding I.T. security are being followed they certainly should not be.

Data Warehouses are fed from source systems using an extract, transform and load (ETL) solution. This means data is extracted from a source system, transformed to meet schema and business requirements of the Data Warehouse and then loaded. This is a data delivery method independent of any particular software vendor. There are various software to accomplish ETLs including the option to create a custom application. A variation of this process is extract, load and transform (ELT) in which the data is landed into tables raw and later transformed to meet the schema and business requirements of their intended final table. This method allows for greater auditability which could aid in regulatory compliance or post-mortems if the transformation process fails.

Once set up the Data Warehouse can facilitate statistical analysis, reporting, data mining and more sophisticated analytical applications that generate actionable information by applying machine learning and artificial intelligence (AI) algorithms (Oracle, 2019).

For an organization a single source of truth which will eliminate inconsistencies in reporting, establish a single set of global metrics and allow everyone in the organization to “sing from the same hymn sheet” is very important due to how beneficial the information provided is in directing informed decisions.

So if Data Warehouses have proven such an excellent platform for generating information why are alternatives needed? Well by design only a subset of the attributes are examined, so only pre-determined questions can be answered (Dixon, 2010). Also the data is aggregated so visibility into the lowest levels is lost (Dixon, 2010). The final major factor is that some of the most vital sources of information are no longer simply numerical in nature and generated by an organizations internal transactional system. So what has changed?

 

The Digital Universe

The data landscape has changed drastically in just a few short years. Like the physical universe, the digital universe is large and growing fast. It was estimated that by 2020 there would be nearly as many digital bits as there are stars in the observable universe (Turner, 2014). That estimate is somewhere in the region of 44 zettabytes, or 44 trillion gigabytes (Turner, 2014). Even though this quantity of data is already beyond human comprehension the rate of growth is probably the more impressive feat. For context there is over 10 times more data now than there was in 2013 when the digital universe was an estimated 4.4 zettabytes (Turner, 2014). The data we create and copy annually is estimated to reach 175 zettabytes by 2025 (Coughlin, 2018).

Where is all this data coming from?

The short answer is predominately us and the systems that service our needs. In the not too distant past the only entities to have computers generating and storing data were businesses, governments and other institutions. Now everyone has a computer of some description and with the advent of social media mass consumers became mass creators. When you stop to think of how many interactions a person has with electronic devices every day, directly or indirectly, you soon get a picture of how much data is actually being generated.

As an example of this endless generation of data the following are average social media usage stats over the course of one minute from 2018 (Marr, 2018):

  • Twitter users sent 473,400 tweets
  • Snapchat users shared 2 million photos
  • Instagram users posted 49,380 pictures
  • LinkedIn gained 120 new users

Other extraordinary data stats include (Marr, 2018):

  • Google processes more than 40,000 searches every second or 3.5 billion searches a day.
  • 5 billion people are active on Facebook every day. That’s one-fifth of the world’s population.
  • Two-thirds of the world’s population now owns a mobile phone.

Our way of life has become increasingly digitized with no better example than the effective global lockdown during the 2020 pandemic. Hundreds of millions of employees from around the world managed to continue working from home and did so effectively (Earley, 2020). This would have been unimaginable even by the late nineties. And yet as digitized as our world has become it is only the start. With emerging technologies such as self-driving cars, IoT smart devices and ever increasingly sophisticated robots entering our homes the 175 zettabytes of data by 2025 maybe a conservative estimate.

With so much of the stuff you would be forgiven for thinking all of this data is just a by-product but it is anything but. The data generated is an incredibly valuable asset if it can be analyzed properly and transformed into business relevant information.

What types of data are there?

The state of data within the digital universe can be summarized as structured, semi-structured and unstructured (Hammer, 2018).

The following is a non-exhaustive list of data types (Hammer, 2018):

  • CRM
  • POS
  • Financial
  • Loyalty card
  • Incident ticket
  • Email
  • PDF
  • Spreadsheet
  • Word processing
  • GPS
  • Log
  • Images
  • Social media
  • XML/JSON
  • Click stream
  • Forums
  • Blogs
  • Web content
  • RSS feed
  • Audio
  • Transcripts

Only the data types above in bold are suitable for aggregation (Hammer, 2018). The rest of the data types are typical of what now makes up a large proportion of the digital universe, and despite their value as data assets they are not suitable for storage or analysis within a Data Warehouse. This is because data needs to meet the predefined structure of a Data Warehouse in order for it to be accepted and aggregating these raw unstructured files, e.g. video and audio files etc., is not possible. So how are these types of valuable data turned into actionable information?

What is a Data Lake?

Data Warehouses have been utilized by data specialists for decades but the concept of Data Lakes is much more contemporary and much better suited to deal with storage, analysis and analytics of the semi-structured and unstructured data listed above. By design storage within a Data Lake of these kinds of data does not require files to be transformed as the file is kept in a raw state. Files can be simply copied from one file structure to another. Data Lakes also allow for working off the files directly which means the data can be used effectively immediately, i.e. as soon as it lands, rather than waiting weeks for the Data Warehouse developers to massage the data into a format that the data warehouse can accept if that is even possible (Hammer, 2018). Working with this type of data has become synonymous with the field of Big Data, which is defined by high velocity, high volume and high variability. As such the two methodologies of Data Warehouses and Data Lakes are not necessarily in competition with each other either, in fact depending on their definition (Data Lake is somewhat of an overloaded term (Bethke, 2017)) they could be argued to resolve difference problems and can complement each other when deployed within the same architecture.

There is some contention as to the definition of a Data Lake. Some would argue that original meaning  implied the Lake was a raw data reservoir solely (Bethke, 2017). By this definition the Data Lake is not too dissimilar to a staging area or Operational Data Store (ODS) in a data warehouse were raw copies of data from source systems are landed (Bethke, 2017). This would coincide with an ELT process as opposed to an ETL process. The transform and integration of the data happens later downstream during the populating of the data warehouse (Bethke, 2017). This understanding of a Data Lake still persists today in the minds of many data specialist as can be seen below in the overly simplified illustration.

Figure 3: Overly simplified illustration of a Data Lake architecture  (Hammer, 2018)

(Note: no indication of analysis being performed on the lake directly, the lake services the warehouse solely)

However it is an inaccurate understanding as the person who is credited with coining the term, James Dixon, used the following analogy when he explained a Data Lake:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” (Dixon, 2010)

By stating “various users of the lake can come to examine, dive in, or take samples” Dixon is clearly implying that a feature of the Data Lake is that it is accessible prior to the data being transformed and made available in a Data Warehouse.

This is where Data Lakes and Data Warehouses take an opposing strategy on applying structure to data which is perhaps why they are often mistaken as alternative competing concepts to each other. A Data Warehouse requires Schema on Write whereas a Data Lake uses Schema on Read.

With schema on write all of the relevant data structure needs to be prepared in advance which means all of the relevant business questions need to be thought of in advance. This rarely results in a situation where all the relevant stakeholders have their needs met and if they do it will not be for very long.  This scenario is workable by an organization looking to aggregate finance data they are very familiar with but it is especially difficult when dealing with Big Data were the questions are unknown.

With schema on read the schema is only applied when the data is read allowing for a schema that is adaptable to the queries being issued. This means you are not stuck with a predetermined one-size-fits-all schema (Pasqua, 2014). This allows for the storage of unstructured data and since it is not necessary to define the schema before storing the data it makes it easier to bring in new data sources on the fly. The exploding growth of unstructured data and overhead of ETL for storing data in RDBMS is the offered as a leading reason for the shift to schema on read (Henson, 2016).

When dealing with Big Data the problem of a predefined schema can be so burdensome that it can sink a data project or increase the time-to-value past the point of relevance (Pasqua, 2014). Using a schema on read approach on data as-is means getting value from it right away (Pasqua, 2014). The flexibility of Data Lakes in this regard allows them to surpass Data Warehouses in terms of scalability while making data accessible for analysis sooner.

 

Data Lakes Scalability

By using schema on read the constraint on scale is virtually removed. The threat of a bottleneck still exists but now in the form of physical constraints in terms of the hardware available. This is why online cloud offerings such as Amazon S3 and Azure Data Lake from Microsoft have become so popular. Of course on-site data centers are also an option with Hadoop being a very popular solution which combines a Data Lake structure with analytically capabilities.  This level of scalability also safe guards against Data Silos. A Data Silo is an undesirable situation where only one group or a limited number of people in an organization have access to a source of data that has a broader relevance to people across an organization (Plixer, 2018).

Data Lakes are intended by design and philosophy to be an antithesis to Data Silos where all an organizations data is stored together in one lake. However centrally storing all data is not without significant security concerns and losing sight of what customer data is on hand can run afoul of numerous legal requirements such as GDPR.

 

Data Lakes Analysis & Analytics

A defining feature of Big Data analytics is the concept of bringing the analytics to the data rather than the data to the analytics. Traditionally analytics was carried out by feeding single flat files into an algorithm with the time taken to prepare these files being significant. Although accessing the raw files directly is potentially a failing as it has the potential to break the principle of a single source of truth and therefore runs the risk of introducing inconsistencies between reports and other forms of analysis. As you can imagine this is complex and disciplined work which is why Data Lakes, at this point in their maturity, are best suited to Data Scientists and advanced Data Analysts (Hammer, 2018). However this goes against the Data Lake ethos of “data for all” as it only allows the very skilled to have access. This creates the problem Data Lakes were meant to solve by imposing restrictions or “data for the select few”. With Data scientists acting as the gatekeepers an organizations stakeholders can lose sight of the useful data available to them. Worse still is that valuable data may come from external sources with stakeholders having no visibly of it prior to it landing in the Data Lake. This may leave stakeholders with no option but to take action based on an analysis produced by a Data Scientist with accuracy of the analysis being a matter of fate because the stakeholder has no data to say otherwise.  In comparison the creation of a Data Warehouse is usually a collaboration of stakeholders, familiar with internal sources systems and data, and developers. Once a Data Warehouse is created, far less skilled (and cheaper) Data Analysts will have the ability to navigate the internal structure and compile valuable reports.

Despite the obvious concerns the significance of scalability and direct raw data analysis cannot be overlooked. The sooner an organization is informed the sooner it can act. In real world terms this could save millions of dollars, save thousands of jobs or stop the organizations itself from going under. However the benefits of scalability and earlier data access are not without risks as poorly managed Data Lakes have the potential to turn into Data Swamps. Data Swamps are poorly managed Data Lakes that become a dumping ground for data. Though the data may be unstructured the method in which it is stored must not be or visibility of what is stored and where it is stored will be lost. Failure to catalogue the data, letting users know what is available while making the attributes of the data known, will overwhelm users and result in the garbage results (Hammer, 2018). Successful implementation of a Data Lake is complex and requires ongoing commitment to maintain but for a large organization that needs to make better use of the wider range of data available in the digital universe a Data Lake is a necessity.

 

Conclusion

A Data Lake is not a replacement for a Data Warehouse. Data Lakes are better equipped to solve the different problems associated with dealing with semi-structured to unstructured data. Their flexibility in this regard allows them to surpass Data Warehouses in terms of scalability while making data accessible for analysis sooner. However Data Lakes are not without their drawbacks. They require highly skilled and expensive staff to develop and maintain. They potentially run a greater risk of failing spectacularly by devolving into a Data Swamp and could potentially become a serious liability from a regulatory standpoint if this was to happen. Organisations can also be left at the mercy of Data Scientists in how accurate they are in analyzing data and producing correct reports as stakeholders may not have the expertise to retrieve data from the Data Lake themselves. 

Thankfully Data Warehouses are still perfectly suited for dealing with numeric data and for organizations that still predominately use their own internal transactional systems in the creation of actionable information these organisations have no immediate need to utilize any alternatives.

 

References:

Bethke, U. (2017) ‘Are Data Lakes Fake News?’, Sonra, 8 August. Available at: http://www.kdnuggets.com/2017/09/data-lakes-fake-news.html (Accessed: 4 July 2020).

BIDataPro (2018) ‘What is Fact Table in Data Warehouse’, BIDataPro, 23 April. Available at: https://bidatapro.net/2018/04/23/what-is-fact-table-in-data-warehouse/ (Accessed: 4 July 2020).

Coughlin, T. (2018) 175 Zettabytes By 2025, Forbes. Available at: https://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025/ (Accessed: 4 July 2020).

Databricks (2019) ‘Unified Data Warehouse’, Databricks, 8 February. Available at: https://databricks.com/glossary/unified-data-warehouse (Accessed: 4 July 2020).

Dixon, J. (2010) ‘Pentaho, Hadoop, and Data Lakes’, James Dixon’s Blog, 14 October. Available at: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/ (Accessed: 4 July 2020).

Earley, K. (2020) Google and Facebook extend work-from-home policies to 2021, Silicon Republic. Available at: https://www.siliconrepublic.com/companies/google-facebook-remote-work-until-2021 (Accessed: 5 July 2020).

Hammer, D. (2018) What is a data lake? – The Hammer | The Hammer. Available at: https://www.sqlhammer.com/what-is-a-data-lake/ (Accessed: 4 July 2020).

Henson, T. (2016) ‘Schema On Read vs. Schema On Write Explained’, Thomas Henson, 14 November. Available at: https://www.thomashenson.com/schema-read-vs-schema-write-explained/ (Accessed: 6 July 2020).

Marr, B. (2018) How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read, Forbes. Available at: https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/ (Accessed: 5 July 2020).

Oracle (2019) What Is a Data Warehouse | Oracle Ireland. Available at: https://www.oracle.com/ie/database/what-is-a-data-warehouse/ (Accessed: 5 July 2020).

Pasqua, J. (2014) Schema-on-Read vs Schema-on-Write, MarkLogic. Available at: https://www.marklogic.com/blog/schema-on-read-vs-schema-on-write/ (Accessed: 6 July 2020).

Plixer (2018) What is a Data Silo and Why is It Bad for Your Organization? Available at: https://www.plixer.com/blog/data-silo-what-is-it-why-is-it-bad/ (Accessed: 6 July 2020). Turner, V. (2014) The Digital Universe of Opportunities. Available at: https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm (Accessed: 5 July 2020).

How to create a C# console application that will solve crosswords

This tutorial will cover the following tasks in C#:

  • How to count the characters in a string
  • How to assign a file’s directory location to a variable
  • How to create a list variable
  • How to pull/read a CSV file column into a list variable
  • How to clean strings using Regex to remove non alpha numeric characters as the strings are being read into a list
  • How to remove duplicate word entries from a list
  • How to order a list
  • How to write variables to the console, including a list’s elements

Assumptions:

You already know how to create projects in Visual Studio.

If you do not how to do this search online using the following term “how to create C# console applications in visual studio”.

Prerequisites:

First you will need to generate a CSV file with random words using this site:

https://onlinerandomtools.com/generate-random-csv

For the option “how many columns to generate” set the value to 1.

For testing purposes create 1000 rows.

Download the csv file generated and save it using the name “words”.

Summary of how the code works:

The code works by reducing the initial list (i.e. the supplied CSV file of random words) down to only words that match the number of characters of the user word, typically referred to as “string length”.

Once that subset of words has been created the code will then compare the user word’s letters against each letter, referencing the relative position, in each word in the subset.

Note: there is still significant room for optimization but the code is functional and works well as an accessible, human readable tutorial.

Use case example:

If the user enters the word “apple” the dictionary subset will be reduced down to 5 letter words only. These five letter words are then compared to the user word, each word and letter at a time. So if the first word in the list was “cabin” the comparison would jump to the next word in the list as the “a” in “apple” does not match the “c” in “cabin”. If the next word in the dictionary was “acorn” the first letters would match but the comparison would jump to the next word when the “c” and “p” did not match.

Instructions:

Create a C# console application called CrosswordSolver in Visual Studio.

Move the CSV file called “Words” into the bin directory of the project folder, i.e. CrosswordSolver\CrosswordSolver\bin

Open the project CrosswordSolver and paste the C# code below into the default window replacing the default cs page code.

The hardcoded example of a user word is:

string userWord = “a****”;

The user can use * to represent characters unknown, for example ap*le.

Note: The CSV file you randomly generated may have no examples of 5 letter words begining with the letter “a” so experiment with other characters.

You can test the letter comparison functionality by uncommenting the two sections of code immediately following the comments “Test letter comparison”.

To test your CSV file has been read into memory you can uncomment the section of code immediately following “Test that dictionary has been read into memory”.

The C# code:

using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace CrosswordSolver
{
    class Program
    {
        static void Main(string[] args)
        {
            int c = 0;
            //User input
            //NOTE: Use * to represent characters unknown 
            string userWord = "a****";
            int wordLength = userWord.Length;

            //Assign directory location of the csv file containing the collection of words to a variable
            string projectFolder = Directory.GetParent(Directory.GetCurrentDirectory()).Parent.FullName;
            string file = Path.Combine(projectFolder, "words.csv");

            //Display dictionary location in console
            Console.WriteLine("Dictionary location: " + file);

            var dictionary = new List<string>();
            using (var rd = new StreamReader(file))

            //Pull file column into dictionary list without cleaning
            //{
            //    while (!rd.EndOfStream)
            //    {
            //        var splits = rd.ReadLine().Split(',');
            //        dictionary.Add(splits[0]);
            //    }
            //}

            //Pull file column into dictionary list while cleaning
            {
                while (!rd.EndOfStream)
                {
                    var splits = rd.ReadLine().Split(',');
                    //string clean is done with Regex
                    dictionary.Add(Regex.Replace(splits[0], "[^A-Za-z0-9 ]", ""));
                }
            }

            //Test that dictionary has been read into memory
            //Console.WriteLine("The dictionary contains the following words:");
            //foreach (var element in dictionary)
            //Console.WriteLine(element);

            //Remove duplicate word entries
            //c = dictionary.Count;
            //Console.WriteLine("The dictionary contains " + c + " words");
            dictionary = dictionary.Distinct().ToList();
            //c = dictionary.Count;
            //Console.WriteLine("The dictionary contains " + c + " words");

            // Count the elements in the List and display test parameters
            c = dictionary.Count;
            Console.WriteLine("The dictionary contains " + c + " words");
            Console.WriteLine("User entered the string: " + userWord);
            Console.WriteLine(userWord + " has " + wordLength + " characters");
            userWord = userWord.ToLower();

            //Reduce the dataset size based on number of characters in string
            IEnumerable<string> query =
                dictionary.Where(word => word.Length == wordLength);

            var subSet = new List<string>();
            foreach (var word in query)
                subSet.Add(word);

            //Order List
            subSet = subSet.OrderBy(x => x).ToList();

            c = subSet.Count;
            if (c != 0)
            {
                Console.WriteLine("The dictionary contains " + c + " words that are " + wordLength + " characters in length");

                //Begin character and position match check
                var result = new List<string>();
                foreach (var word in subSet)

                {
                    for (int i = 0; i <= wordLength - 1; i++)
                    {

                        if ((word.ToLower()[i] == userWord[i]) | (userWord[i] == '*'))
                        {

                            //Test letter comparison (Letters match)
                            //Console.WriteLine(
                            //"Letter " + i + ", which is " + "\"" + word[i] + "\"" + ", of the word " + "\"" + word + "\"" +
                            //" matches letter " + i + ", which is " + "\"" + userWord[i] + "\"" + ", of the user input " + "\"" + userWord + "\""
                            //);

                            if (i == wordLength - 1)
                            { result.Add(word); }

                        }
                        else
                        {
                            //Test letter comparison (Letters do not match)
                            //Console.WriteLine(
                            //"Letter " + i + ", which is " + "\"" + word[i] + "\"" + ", of the word " + "\"" + word + "\"" +
                            //" does not match letter " + i + ", which is " + "\"" + userWord[i] + "\"" + ", of the user input " + "\"" + userWord + "\""
                            //);

                            break;
                        }
                    }
                }

                //Test words that do not match
                //foreach (var word in subSetToRemove)
                //Console.WriteLine(word);

                bool isEmpty = !result.Any();
                if (isEmpty)
                {
                    Console.WriteLine("No matches found");
                }
                else
                {
                    c = result.Count();
                    Console.WriteLine("Potential matches found: " + c);
                    foreach (var word in result)
                        Console.WriteLine(word);
                }
            }
            else
            {
                Console.WriteLine("No words of " + wordLength + " characters long found");
            }
            Console.ReadKey();
        }
    }
}

 

If you found this code useful be sure to like the post and comment. ☮

If you would like to know how to create a csv file with C# see this tutorial link.

If you would like to know how to create a console application in Visual Studio that won’t open a command window when it runs see this tutorial link.

 

How to handle a Database creation request

If you are working as a DBA you may find that developers will ask you to create a database having given little thought to what the database will be used for or the impact the database could have to the resources or security of the hosting environment.

If you find yourself in that situation I would suggest you walk the requester through the questionnaire from the previous article “How to determine where a new database should be deployed“.

Once you have completed that process I would then suggest that an official request to deploy a new database be made using the DATABASE REQUEST FORM provided here link. If you have a change request process I would still suggest you use this form. Having a database specific request form covers more relevant and vital information.

This is a fairly high level request form with most of the technical details still to be defined by the DBA but it provides documentation of the request and states the requester’s initial expectations and requirements.

Following the database deployment if the actual footprint of the database does not match up with what was agreed the form will confirm if the requested resources were under specced or misleading.

The form is outlined as below.

DATABASE REQUEST FORM image

Some important points the form clarifies:

There’s a difference between requester and owner. If the database runs into any problems you don’t want to be contacting the temporary intern that requested it instead of say the department head.

The application the database supports. Most of the time the database name will have some tie-in to the application name but maybe it does not. For instance the database could be named something generic like Reporting which could be the back end for really anything.

The form asks the requester to prepare a profile for the database. I could have named this section “who is your daddy and what does he do?”. If the requester states they are looking for a OLAP reporting database but operationally it’s running as a OLTP transactional database, that could make a big difference in terms of the resources provided for the database and underlying hardware.

Possibly most importantly the form helps to establish the likely impact of the new database with the Resource Impact Estimation section. For example if a requester asks for 10 Gb of space and states they expect space usage to increase by 5 Gb a year but the disk has lost a terabyte in the first few months the form will clarify who got their numbers wrong.

The user and groups section will clarify who should have access to the database. Effectively everything related to data and data access should be okayed by a compliance officer to confirm everything is GDPR compliant. This form will assist the compliance officer in establishing that.

The Business continuity & Upkeep section is really the domain of the DBA but it helps to get requester input on these matters. For instance establishing maintenance windows.

If you have any additional questions you feel should be on the form please feel free to contact me and I’ll add them.

How to assess a SQL Server instance for GDPR compliance by writing every table and column to Excel

This post uses the script I had written before here LINK modified slightly to include a count of each table. If you are a DBA you’re likely assisting compliance offers to find personal identifiable data within the databases at this time. This script will allow you to provide them with a record of every database, table and column on an entire instance. The compliance offer can then sieve through all the columns and highlight any columns that look like they contain personal data for further investigation.

Below is a SQL query that will return the following metadata about each table from each database located on a SQL server database server:

  • ServerName
  • DatabaseName
  • SchemaName
  • TableName
  • CountOfRows
  • ColumnName
  • KeyType

The output is provided in a table format with these additional formatted lines of text which can be used as queries or as part of queries.

  • A Select table query
  • A Count table row columns
  • A Select column query
  • Each column bracketed
  • Each table and column bracketed

The table returned by the query can be exported to excel. Using excels filter option applied to the columns of the table makes finding and selecting specific tables and columns very easy.

This process can be repeated for every SQL Server instance used by the business to generate a single mapped servers master excel file allowing the user to find any table or column available to the organization quickly.

Applying some colour coding like below adds to the ease of use.

Image of excel file with mapped database server structure

How to use:

Simply open SQL Server Management Studio and from object explorer right click on the server name and select new query. This will open a window set to the master database of the server. Copy and paste the SQL below into this SQL Server window and execute. When the query is finished you will have created the table above.

/*
SCRIPT UPDATED
20180316
*/

USE [master]
GO

/*DROP TEMP TABLES IF THEY EXIST*/
IF OBJECT_ID('tempdb..#DatabaseList') IS NOT NULL
	DROP TABLE #DatabaseList;

IF OBJECT_ID('tempdb..#TableStructure') IS NOT NULL
	DROP TABLE #TableStructure;

IF OBJECT_ID('tempdb..#ErrorTable') IS NOT NULL
	DROP TABLE #ErrorTable;

IF OBJECT_ID('tempdb..#MappedServer') IS NOT NULL
	DROP TABLE #MappedServer;
	
IF OBJECT_ID('tempdb..#TableCount') IS NOT NULL
	DROP TABLE #TableCount;
	
IF OBJECT_ID('tempdb..#Count') IS NOT NULL
	DROP TABLE #Count;

DECLARE @ServerName AS SYSNAME
DECLARE @Count INT

SET @ServerName = @@SERVERNAME

CREATE TABLE #DatabaseList (
	Id INT NOT NULL IDENTITY(1, 1) PRIMARY KEY
	,ServerName SYSNAME
	,DbName SYSNAME
	);

CREATE TABLE [#TableStructure] (
	[DbName] SYSNAME
	,[SchemaName] SYSNAME
	,[TableName] SYSNAME
	,[ColumnName] SYSNAME
	,[KeyType] CHAR(7)
	) ON [PRIMARY];
	
CREATE TABLE [#TableCount] (
	[Id_TableCount] INT NOT NULL IDENTITY(1, 1) PRIMARY KEY
	,[ServerName] SYSNAME
	,[DatabaseName] SYSNAME
	,[SchemaName] SYSNAME
	,[TableName] SYSNAME
	,[SQLCommand] VARCHAR(MAX)
	,[TableCount] INT
	);

CREATE TABLE #Count (ReturnedCount INT);

/*THE ERROR TABLE WILL STORE THE DYNAMIC SQL THAT DID NOT WORK*/
CREATE TABLE [#ErrorTable] ([SqlCommand] VARCHAR(MAX)) ON [PRIMARY];

/*
A LIST OF DISTINCT DATABASE NAMES IS CREATED
THESE TWO COLUMNS ARE STORED IN THE #DatabaseList TEMP TABLE
THIS TABLE IS USED IN A FOR LOOP TO GET EACH DATABASE NAME
*/
INSERT INTO #DatabaseList (
	ServerName
	,DbName
	)
SELECT @ServerName
	,NAME AS DbName
FROM master.dbo.sysdatabases WITH (NOLOCK)
WHERE NAME <> 'tempdb'
ORDER BY NAME ASC

/*VARIABLES ARE DECLARED FOR USE IN THE FOLLOWING FOR LOOP*/
DECLARE @sqlCommand AS VARCHAR(MAX)
DECLARE @DbName AS SYSNAME
DECLARE @i AS INT
DECLARE @z AS INT

SET @i = 1
SET @z = (
		SELECT COUNT(*) + 1
		FROM #DatabaseList
		)

/*WHILE 1 IS LESS THAN THE NUMBER OF DATABASE NAMES IN #DatabaseList*/
WHILE @i < @z
BEGIN
	/*GET NEW DATABASE NAME*/
	SET @DbName = (
			SELECT [DbName]
			FROM #DatabaseList
			WHERE Id = @i
			)
	/*CREATE DYNAMIC SQL TO GET EACH TABLE NAME AND COLUMN NAME FROM EACH DATABASE*/
	SET @sqlCommand = 'USE [' + @DbName + '];' + '

INSERT INTO [#TableStructure]
SELECT DISTINCT ' + '''' + @DbName + '''' + ' AS DbName
	,SCHEMA_NAME(SCHEMA_ID) AS SchemaName
	,T.NAME AS TableName	
	,C.NAME AS ColumnName
	,CASE 
		WHEN OBJECTPROPERTY(OBJECT_ID(iskcu.CONSTRAINT_NAME), ''IsPrimaryKey'') = 1 
			THEN ''Primary'' 
		WHEN OBJECTPROPERTY(OBJECT_ID(iskcu.CONSTRAINT_NAME), ''IsForeignKey'') = 1 
			THEN ''Foreign''
		ELSE NULL 
		END AS ''KeyType''
FROM SYS.TABLES AS t WITH (NOLOCK)
INNER JOIN SYS.COLUMNS C ON T.OBJECT_ID = C.OBJECT_ID
LEFT JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE AS iskcu WITH (NOLOCK) 
ON SCHEMA_NAME(SCHEMA_ID) = iskcu.TABLE_SCHEMA 
	AND T.NAME = iskcu.TABLE_NAME
	AND C.NAME = iskcu.COLUMN_NAME
ORDER BY SchemaName ASC
	,TableName ASC
	,ColumnName ASC;
';

	/*ERROR HANDLING*/
	BEGIN TRY
		EXEC (@sqlCommand)
	END TRY

	BEGIN CATCH
		INSERT INTO #ErrorTable
		SELECT (@sqlCommand)
	END CATCH

	SET @i = @i + 1
END

/*
JOIN THE TEMP TABLES TOGETHER TO CREATE A MAPPED STRUCTURE OF THE SERVER
ADDITIONAL FIELDS ARE ADDED TO MAKE SELECTING TABLES AND FIELDS EASIER
*/
SELECT DISTINCT @@SERVERNAME AS ServerName
	,DL.DbName
	,TS.SchemaName
	,TS.TableName
	,TS.ColumnName
	,TS.[KeyType]
	,',' + QUOTENAME(TS.ColumnName) AS BracketedColumn
	,',' + QUOTENAME(TS.TableName) + '.' + QUOTENAME(TS.ColumnName) AS BracketedTableAndColumn
	,'SELECT * FROM ' + QUOTENAME(DL.DbName) + '.' + QUOTENAME(TS.SchemaName) + '.' + QUOTENAME(TS.TableName) + '--WHERE --GROUP BY --HAVING --ORDER BY' AS [SelectTable]
	,'SELECT ' + QUOTENAME(TS.TableName) + '.' + QUOTENAME(TS.ColumnName) + ' FROM ' + QUOTENAME(DL.DbName) + '.' + QUOTENAME(TS.SchemaName) + '.' + QUOTENAME(TS.TableName) + '--WHERE --GROUP BY --HAVING --ORDER BY' AS [SelectColumn]
	,'SELECT COUNT(*) FROM ' + QUOTENAME(DL.DbName) + '.' + QUOTENAME(TS.SchemaName) + '.' + QUOTENAME(TS.TableName) + 'WITH (NOLOCK)' AS [PerformTableCount]
INTO #MappedServer
FROM [#DatabaseList] AS DL
INNER JOIN [#TableStructure] AS TS ON DL.DbName = TS.DbName
ORDER BY DL.DbName ASC
	,TS.SchemaName ASC
	,TS.TableName ASC
	,TS.ColumnName ASC

/*
HOUSE KEEPING
*/
IF OBJECT_ID('tempdb..#DatabaseList') IS NOT NULL
	DROP TABLE #DatabaseList;

IF OBJECT_ID('tempdb..#TableStructure') IS NOT NULL
	DROP TABLE #TableStructure;
	
INSERT INTO #TableCount (
	[ServerName]
	,[DatabaseName]
	,[SchemaName]
	,[TableName]
	,[SQLCommand]
	)
SELECT DISTINCT [ServerName]
	,[DbName]
	,[SchemaName]
	,[TableName]
	,[PerformTableCount]
FROM #MappedServer
ORDER BY [ServerName] ASC
	,[DbName] ASC
	,[SchemaName] ASC
	,[TableName] ASC

SET @i = 1
SET @z = (
		SELECT COUNT(*) + 1
		FROM #TableCount
		)
	
WHILE @i < @z
BEGIN
	SET @SQLCommand = (
			SELECT SQLCommand
			FROM #TableCount
			WHERE Id_TableCount = @i
			)

	--ERROR HANDLING
	BEGIN TRY
		INSERT INTO #Count
		EXEC (@SqlCommand)
	END TRY

	BEGIN CATCH
		INSERT INTO #ErrorTable
		SELECT (@sqlCommand)
	END CATCH

	SET @Count = (
			SELECT ReturnedCount
			FROM #Count
			)

	TRUNCATE TABLE #Count

	UPDATE #TableCount
	SET TableCount = @Count
	WHERE Id_TableCount = @i;

	SET @i = @i + 1
END

SELECT *
FROM #ErrorTable;

IF OBJECT_ID('tempdb..#ErrorTable') IS NOT NULL
	DROP TABLE #ErrorTable;

/*
THE DATA RETURNED CAN NOW BE EXPORTED TO EXCEL
USING A FILTERED SEARCH WILL NOW MAKE FINDING FIELDS A VERY EASY PROCESS
*/
SELECT DISTINCT ms.ServerName
	,ms.DbName
	,ms.SchemaName
	,ms.TableName
	,ts.TableCount AS CountOfRows
	,ms.ColumnName
	,ms.KeyType
	,ms.BracketedColumn
	,ms.BracketedTableAndColumn
	,ms.SelectColumn
	,ms.SelectTable
FROM #MappedServer AS ms
LEFT JOIN #TableCount AS ts ON ms.ServerName = ts.ServerName
AND ms.DbName = ts.DatabaseName
AND ms.SchemaName = ts.SchemaName
AND ms.TableName = ts.TableName
ORDER BY ms.DbName ASC
	,ms.SchemaName ASC
	,ms.TableName ASC
	,ms.ColumnName ASC;
	
IF OBJECT_ID('tempdb..#MappedServer') IS NOT NULL
	DROP TABLE #MappedServer;

IF OBJECT_ID('tempdb..#TableCount') IS NOT NULL
	DROP TABLE #TableCount;
	
IF OBJECT_ID('tempdb..#Count') IS NOT NULL
	DROP TABLE #Count;

 

How to determine where a new database should be deployed

Below I’ve listed 22 questions to consider when deciding where a new SQL Server database should be deployed given the four main options of:

  1. Add Database to existing Instance
  2. Create separate Instance for Database
  3. Create separate VM for Instance and Database
  4. New physical machine needed

There’s no hard and fast rule for determining where a database should go however there are often a lot of limitations dictating your choices. I’ve started the list with a big limitation, money.

If you want to use the list like a score card I’ve prepared one called “New Database Hosting Environment Assessment Form”, I know not the snappiest of titles.

New Database Hosting Environment Assessment Form

Use: I’d recommend stepping through the questions with the requester while gathering feedback. You can then fill in the option scoring with weighted figures or simply use the boxes to record notes. The main deliverable of the form is you will have some documentation that demonstrates the location for the database was assessed and the logic and reasoning for the location chosen has been recorded.

Of course consider adding additional questions relevant to your case to the form. If you think your additional questions would be relevant to everyone deploying a new database free to forward your questions on to me and I’ll include them in the form.

The form can be downloaded here link.

Question 1: Can the business afford a new physical machine?

Considerations: If the financial resources are not available option four
can be eliminated.

Question 2: Are there OS or SQL Server Licensing restrictions?

Considerations: If there are no restrictions to a new VM with it’s own instance this will likely offer the most scalable solution and produce the best performance. If restrictions exist options are limited to 1 & 2.

Question 3: Is the Database the back end for a stand-alone application?

Considerations: The back end for a stand-alone application should be isolated as much as possible from other impacting factors. The higher the number of the option chosen the better.

Question 4: What is the primary purpose of the Database?

Considerations: What is the business use\s and in what capacity will the database function i.e. Transactional Db, Reporting Db, Datastore?

Question 5: Do you want\need to be able to limit access to hardware resources?

Considerations: Access to resources, ram etc. can only be limited at an instance level. Option 1 cannot be used.

Question 6: Are there any SQL Server Service Pack dependencies?

Considerations: SQL Server can have different SPs on different instances but cannot have different SPs for different databases within an instance.

Question 7: What is the current excess capacity of the hardware?

Considerations: Can the Hardware handle the additional workload? If not either reduce resource consumption of resources on the hardware, add RAM and/or cores or choose option 4.

Question 8: Is there a VM capacity restraint on the hardware?

Considerations: Can the hardware actually host a new VM without a trade off in resources or a decrease in performance? If restrained option 3 cannot be used.

Question 9: What is the VM capacity?

Considerations: Is the OS already maxed out on resources? Can the VM handle the additional workload?

Question 10: Is there an expected increase in size and usage in the Database over time?

Considerations: If known and minimal option 1 can be used. If unknown or unlimited the higher the number of the option chosen the better.

Question 11: Is the resource usage of the new Database known?

Considerations: Benchmarking RAM, CPU and bandwidth usage should be carried out prior to installation.

Question 12: What are the disaster recovery requirements?

Considerations: Should the databases that share common dependencies be hosted on
the same server?

Question 13: What is the required operational up time? 24/7 etc.

Considerations: Does this operational run time tie in with the rest of the databases
on the instance, or the instances hosted on a VM or physical server?

Question 14: What are the Maintenance requirements?

Considerations: Will new index rebuilds etc. take a long time and effect the schedule of the instance overall?

Question 15: What are the Backup requirements?

Considerations: Will the backup schedule of the other databases be impacted
by the addition?

Question 16: Is the Database functionally similar or supporting databases currently on the instance?

Considerations: Does the new Database logically fit in with the database/s currently running on an instance?

Question 17: Have server security concerns been addressed?

Considerations: Will people with administrative access to the server have access to a Database and data they should not be able to see?

Question 18: Does hosting particular databases together compound the potential
damage If an unauthorised person was to gain access to the server?

Considerations: Will an unauthorised person have access to data that represents a major security breach if the data is combined from different databases?

Question 19: Does a vendor have control over the instance or server?

Considerations: Will putting a Database on a particular instance leave you at the mercy of a vendor?

Question 20: Will stacking instances make tracking down performance issues across
the VM difficult?

Considerations: Will this create a server level noisy neighbour problem?

Question 21: Will packing databases too densely make tracking down performance issues across the Instance difficult?

Considerations: Will this create an instance level noisy neighbour problem?

Question 22: Will moving the Database to be hosted somewhere else in the
future be possible?

Considerations: Does this decision need to be gotten right the first time?

How to pass arguments from command line to a console application written in C#

This is a simple tutorial on passing arguments or parameter values from command line to a console application written in C#. Using the example below you should be able to edit and expand on the logic to fit your own needs.

First you’ll need to create a new Visual Studio C# console application, to do so follow these steps:

To create and run a console application

  1. Start Visual Studio.

  2. On the menu bar, choose FileNewProject.
  3. Expand Installed, expand Templates, expand Visual C#, and then choose Console Application.
  4. In the Name box, specify a name for your project, and then choose the OK button.
  5. If Program.cs isn’t open in the Code Editor, open the shortcut menu for Program.cs in Solution Explorer, and then choose View Code.
  6. Replace the contents of Program.cs with the following code.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace TestArgsInput
{
    class Program
    {
        static void Main(string[] args)
        {
            if (args.Length == 0)
            {
				// Display message to user to provide parameters.
                System.Console.WriteLine("Please enter parameter values.");
                Console.Read();
            }
            else
            {
                // Loop through array to list args parameters.
                for (int i = 0; i < args.Length; i++)
                {
                    Console.Write(args[i] + Environment.NewLine);
                    
                }
                // Keep the console window open after the program has run.
                Console.Read();
            }
        }
    }
}

 

The Main method is the entry point of a C# application. When the application is started, the Main method is the first method that is invoked.

The parameter of the Main method is a String array that represents the command-line arguments. Usually you determine whether arguments exist by testing the Length property as in the example above.

When run the example above will list out the parameters you have provided to the command window. The delimiter for command line separating arguments or parameter values is a single space. For example the following would be interpreted as two arguments or parameter values:

“This is parameter 1” “This is parameter 2”

If the arguments were not enclosed by double quotes each word would be considered an argument.

To pass arguments to the console application when testing the application logic the arguments can be written into the debug section of the project properties as shown below.

TestArgs

So if the app is run with the command line arguments provided as above in the image the command window will list:
Parameter 1
Parameter 2
If you would like to know how to create a console application in Visual Studio that won’t open a command window when it runs see this tutorial link.
If you would like to know how to create a csv file with C# see this tutorial link.