Data cataloging and data visualization: two requirements in today’s organizations

With the proliferation of data, cataloging becomes a necessity for every large company. Once that system is in place, using data visualization techniques that tell a story can pay big dividends. Here Priya Iragavarapu of AArete discusses modern data catalog systems and three factors in designing good data visualization.

Data cataloging: no longer just a “nice to have”

As the volume and variety of data grow exponentially, so does the importance of data catalogs and data visualization. The uncontrolled growth of data with evolving attributes poses a major challenge – it makes managing metadata more and more challenging.

Enterprise data management is particularly affected by this data glut. With complicated nested data attributes, it is difficult for stakeholders to take a snapshot of data, explore the metadata, and then build a data catalog or business glossary and use it as a reference forever.

Therefore, cataloging data is not only an unavoidable necessity, but must also be done in real time, making it possible for crawling datasets to identify metadata. These data catalogs have two tasks: document the metadata accurately and effectively and flag any anomalous metadata for discrepancies.

Another reason why there is an increased need for data catalogs is the prevalence of hybrid, cross-collaborative teams with dotted line connections within matrix organizations. Every team throughout the data lifecycle needs to understand data beyond their direct expertise to effectively perform their role. When data catalog is run this way, the data line can be followed to understand how the data catalog evolves and changes at each step in the data pipeline.

Organizations should look for data catalog solutions with the following key features. First, the data cataloging solution must be able to automatically crawl data and dynamically detect the data attributes, data types, and data profiles. In addition, many leading solutions include user input to create a data dictionary or business glossary. Desirable data cataloging programs are also capable of translating statistics into user-friendly visuals. Finally, a robust data catalog solution should not only show metadata, but enable users to take action based on that insight.

However, there are trade-offs when comparing newer, comprehensive data catalog capabilities to more traditional approaches. The traditional approach refers to building a custom script to crawl data and writing metadata relevant data into a table for further analysis.

It is also a rather manual process to keep track of when and how often to run the script, which has the disadvantage of batch processing. The more advanced custom solutions consist of real-time streaming data crawlers, which determine the metadata and detect any changes in real time. This program is ideal for many low latency applications. However, these advanced data cataloging solutions come with concerns regarding resources, computational complexity, and cost.

Complex programs can also pose a security risk. Systems that offer the most automated discovery capabilities raise the most concerns for operational IT professionals. They are asked to allow their firewall to be breached so that a cloud-based solution can gain access or to install a new system on-premises.

If these concerns prevent an organization from taking the modern approach, there are many ready-to-use products that organizations can use for data catalog solutions. These can or may need to be better integrated depending on the technology stack and legacy systems in place within the organization. But organizations must determine where they fit on the spectrum, from building a custom solution to using a ready-made product. It all depends on the nature of the data and the needs of the organization.

Data visualization: it should tell a story

Once a data cataloging system has been chosen and implemented, organizations must figure out how to best use that data.

Data visualization technology has advanced significantly over the past decade, spawning advanced software such as Tableau, Power BI, Qlik, Looker, and IBM Cognos. Modern technology companies are eager to incorporate data visualization into their practices, but many struggle with choosing a program that best fits their needs. Here are several aspects that organizations should consider before deciding which data

Size and source of the data to be visualized

The first consideration is both the size and source of the data. These qualities influence which software is appropriate and whether two tools should be sewn together to properly serve the organization’s data visualization needs. For example, a company stores its data in cold storage such as S3, and this S3 is directly linked to Tableau. Even if Tableau provides that connector, the performance of the visualization job will suffer. Tableau is a remarkable visualization tool, but placing the responsibility of querying on Tableau impacts performance and latencies. In this case, Qlik is a much better tool because it has a built-in query engine, which efficiently executes a query on large data sets and cold storage. Again, this is not a criticism of Tableau; it simply means that the user must adequately assess the strengths and weaknesses of visualization tools and align them with their organization’s objectives.

Technology stack of the organization

Another factor is the technology stack of the organization. This should be well thought out before committing to an individual data visualization tool. For example, an organization may already be invested in an Azure Cloud or IBM ecosystem or another technology stack of your choice. A few examples: If a company uses the IBM ecosystem, it would make sense to use IBM Cognos; or if the organization used Azure Cloud, Power BI would be the smartest choice. Tools can only be combined when there is no unified strategy relevant to a one-stop-shop technology stack. For the most part, most tools are built to have connectors to mix and match with other tools.

The degree of data pre-processing required

The last factor to consider is data pre-processing. Ideally, visualization queries should query data directly and be able to filter, sort, and aggregate data within the tool. If pre-processing is complicated, it puts additional responsibility on the data visualization program, which affects performance. Therefore, the pre-processing data engineering work must be handled outside the tool. An assortment of pre-processing tools match their data visualization equivalents. For example, Tableau uses Tableau Prep. By carefully considering the extent of data preparation required, the user can predict the performance of the data visualization and the rate at which the data is visualized.

Aside from the above considerations, organizations choosing data visualization initiatives should recognize that the choices for color, graph type, and visualization type determine the impact data visualization will have on their business. The most effective data visualization solutions combine art with science.

Most importantly, powerful data visualization software doesn’t just spew out scatter plots, heat maps, pie charts, or bar charts—it tells a story. Industry leaders rely on these tools because they can carve narrative arcs without sacrificing the ability to experiment with different approaches. As data visualization technology advances, these trends will become increasingly apparent, leading companies to use visualization tools to efficiently develop data products that are increasingly in line with consumer demand.

With the proliferation of data comes potential benefits and a serious responsibility for organizations. To be most effective, they need systems that understand what they have, make sure the data is up-to-date and retrievable, and turn the data into visualizations that help tell a story. Many tools exist that, when used wisely, can help organizations achieve all of these goals. They need to know what to use and how to use it.

How up-to-date are your organization’s data visualization and data catalogs? Let us know Facebook, Twitterand LinkedIn.


Leave a Reply