Microsoft SQL Server 2012 Integration Services Overview
- 9/15/2012
- Common Usage Scenarios for SSIS
- Evolution of SSIS
- Setting Up SSIS
- Summary
Enterprises depend on data integration to turn data into valuable insights and decisions. Enterprise data integration is a complicated problem due to the heterogeneity of data sources and formats, ever-increasing data volumes, and the poor quality of data. Data is typically stored in disparate systems and the result is that there are differences in data format or schema that must be resolved. The constantly decreasing costs of storage lead to increased data retention and a concomitant increase in the volume of data that needs to be processed. In turn, this results in an ever-increasing demand for scalable and high-performance data integration solutions so organizations can obtain timely insights from the collected data. The diversity of data and inconsistent duplication cause quality problems that can impact the accuracy of analytical insights and thus also affect the quality and value of the decisions. Data integration projects need to deal with these challenges and effectively consume data from a variety of sources (e.g., databases, spreadsheets, files, etc.), which requires that they clean, correlate, transform, and move the source data to the destination systems. This process is further complicated because many organizations have round-the-clock dependencies on data stores; therefore, data integration must often be frequent and integration operations must be completed as quickly as possible.
Microsoft SQL Server Integration Services (SSIS) technology addresses these challenges and provides a platform for building and managing data integration solutions. The flexible, extensible, scalable, high-performance platform and toolset in SSIS caters to enterprise demands for traditional data extract-transform-load (ETL) processes as well as other data integration needs. SSIS is a feature of Microsoft SQL Server that provides seamless interoperability with other features shipped with both SQL Server and with other Microsoft products. Typical data integration scenarios that you can address using SSIS include the following:
Consolidating data from heterogeneous data sources
Moving data between systems
Loading data warehouses
Cleaning, formatting, or standardizing data
Identifying, capturing, and processing data changes
Coordinating data maintenance, processing, or analysis
Some data processing scenarios require specialized technology. SSIS is not suitable for the following types of data processing:
Processing complex business rules on data
Coordinating, acquiring, and processing data in inter-business processes
Processing event messages in real time
Coordinating data communication between systems
Building a federated view of data sources
Unstructured data processing and integration
Common Usage Scenarios for SSIS
In this section, you’ll examine some common data integration scenarios in detail and get an overview of how key SSIS features help in each of those scenarios.
Consolidation of Data from Heterogeneous Data Sources
In an organization, data is typically not contained in one system but spread all over. Different applications might have their own data stores with different schema. Similarly, different parts of the organization might have their own locally consolidated view of data, or legacy systems might be isolated, making the data available to rest of the organization at regular intervals. To make important organization-wide decisions that derive value from all this data, it is necessary to pull data from all parts of the organization, massaging and transforming it into a consistent state and shape.
The need for data consolidation also arises during organization acquisitions or mergers. Supporting connectivity to heterogeneous stores and extracting data is a key feature of any data integration software. SSIS supports connectivity to a wide range of data stores using out-of-the-box adapters and extensibility features. Source adapters read data from external sources into SSIS, whereas destination adapters write data from SSIS to external destinations.
Some of the most important built-in source and destination adapters in SSIS are the following:
OLE DB Source and Destination
ADO.NET Source and Destination
ODBC Source and Destination
Flat File Source and Destination
Excel Source and Destination
XML Source
Other types of SSIS adapters are as follows:
Script Source and Destination: These enable SSIS developers to author code to connect to data stores that are not supported by built-in adapters in SSIS.
Special-purpose adapters: Most of the adapters in SSIS are general purpose, supporting any data store that can be accessed through standard interfaces; however, some of the adapters are specific to a particular data store and depend on a specific application programming interface (API). The SQL Server Destination and Dimension Processing Destination are examples of special-purpose adapters that provide connectivity to SQL Server and Analysis Server, respectively.
Custom adapters: Using the extensibility mechanism in SSIS, customers and independent software vendors (ISVs) can build adapters that can be used to connect to data stores that do not have any built-in support in SSIS.
Source and destination adapters that are not a part of SSIS installation but available for download from Microsoft.com include the following:
Oracle Source and Destination
Teradata Source and Destination
SAP BI Source and Destination
SSIS adapters maintain connection information to external data stores using connection managers. SSIS connection managers depend on technology-specific data providers or drivers for connecting to data stores. For example, OLE DB adapters use the OLE DB API and data provider to access data stores that support OLE DB. SSIS connectivity adapters are used within a Dataflow Task, which is powered by a data pipeline engine that facilitates high-performance data movement and transformation between sources and destinations. Figure 1-1 illustrates flow of data from source to destination through data providers or drivers.
Figure 1-1 Representation of data flow from source to destination.
Integration Services offers several options for connecting to relational databases. OLE DB, ADO.NET, and ODBC adapters provide data store generic APIs for connecting to a wide range of databases. The only popular database connectivity option that is not supported in SSIS is Java Database Connectivity (JDBC). SSIS developers are often faced with the challenge of picking an adapter from the choices to connect to a particular data store. The factors that SSIS developers should consider when picking the connectivity options are as follows:
Data type support
Metadata exposed by driver or provider
Driver or provider support in 32 and 64-bit environments
Performance
Data Type Support
Data type support in relational databases beyond the standard ANSI SQL data types differs; each has its own type system. Data types supported by data providers and drivers provide a layer of abstraction for the type systems in data stores. Data integration tools need to ensure that they don’t lose type information when reading, processing, or writing data. SSIS has its own data type system. Adapters in SSIS map external data types exposed by data providers to SSIS data types, and maintain data type fidelity during interactions with external stores. The SSIS data type system ameliorates problems when dealing with data type differences among storage systems and providers, providing a consistent basis for data processing. SSIS implicitly converts data to the equivalent types in its own data type system when reading or writing data. When that is not possible, it might be necessary to explicitly convert data to binary or string types to avoid data loss.
Metadata Exposed by Provider
SQL Server Data Tools provides the development environment in which you can build SSIS packages, which are executable units in SSIS. Design experience in SQL Server Data Tools depends on the metadata exposed by data stores through drivers or providers to guide SSIS developers in setting package properties. Such metadata is used to get a list of databases, tables, views, and metadata of columns in tables or views during package construction. If a data store does not expose a particular metadata or if the driver does not implement an interface to get some metadata from the data stores, the SSIS package development experience will be affected. Manually setting the relevant properties in SSIS packages could help in those instances.
Support in 32 and 64-Bit Environments
You can execute SSIS packages in either 32-bit or 64-bit modes. If the application is a 32-bit application, SSIS uses the 32-bit data provider.
The 32 and 64-bit versions of data providers usually have the same identifier. Once referenced using the identifier, the version of the data provider loaded during execution time will depend on the application that loads it. The data provider available to SSIS packages will depend on the bit mode under which the package is executed. For example, execution of packages inside SQL Server Data Tools is in 32-bit mode by default; hence the 32-bit provider will be used during execution in SQL Server Data Tools. Packages that successfully execute in 32-bit mode do not necessarily execute successfully in 64-bit mode (and vice versa). This is because data providers or drivers might not be available in both modes. If the 64-bit driver is not available on the executing machine, execution will fail when attempting 64-bit execution and vice versa. SSIS package developers and administrators have to keep this in mind during package development and execution.
Performance
Several factors impact the performance of data integration operations. One of the main factors is adapter performance, which is directly related to the performance of the low-level data providers or drivers used by the adapters. Although there are general recommendations (see Table 1-1) for what adapter to use for each popular database, there is no guarantee that you will get the best performance from the recommended adapters. Adapter performance depends on several factors, such as the driver or data provider involved, and the bit mode of the drivers. We recommend that SSIS developers compare performance of different connectivity options before determining which one to use in the production environment.
Table 1-1 Recommended adapters for some popular data stores
Database |
Recommended adapters |
SQL Server |
OLE DB Source and Destination |
Oracle |
Oracle Source and Destination |
Teradata |
Teradata Source and Destination |
DB2 |
OLE DB Source and Destination |
MySQL |
ODBC Source and Destination |
SAP BW |
SAP BI Source and Destination |
SAP R/3 |
ADO.Net Source and Destination |
Movement of Data Between Systems
The data integration scenario in this section covers moving data between data storage systems. Data movement can be a one-time operation during system or application migration, or it can be a recurring process that periodically moves data from one data store to another. An example of one-time movement is data migration before discontinuing an old system. Copying incremental data from a legacy system at regular intervals to a newer data store, to ensure the new system is a super set of the older one is an example of recurring data movement. These types of transfers usually involve data transformation so that the moved data conforms to the schema of the destination system. The source and destination adapters in SSIS discussed earlier in this chapter can help with connecting to the old and new systems.
You use transform components in SSIS to perform operations such as conversion, grouping, merging, sampling, sorting, distribution, or other common operations on the data that is extracted into the SSIS data pipeline. In SSIS, these transform components take data flow pipeline data as input, process it, and add the output back to the pipeline, which can be of the same shape or different than the input. Transform components can operate on data row-by-row, on a subset of rows, or on the entire data set at once. All transformations in SSIS are executed in memory, which helps with high-performance data processing and transformation. Each data transformation operation is defined on one or more columns of data in the data flow pipeline. To perform operations not supported out of the box, SSIS developers can use scripts or build custom transformations. Built-in SSIS transforms that support some of the most common data operations are as follows:
Aggregate Applies aggregate functions, such as Average, Count, or Group By, to column values and copies the results to the transformation output.
Conditional split Routes data rows to different outputs depending on the data content.
Multicast Distributes every row from input to one or more outputs for branched processing.
Lookup Performs lookups by joining data in input columns with columns in a reference dataset.
Merge Combines two sorted datasets into a single dataset.
Sort Sorts input data in ascending or descending order.
Union all Combines multiple inputs into one output.
Data Conversion transform Converts the data in an input column to a different data type.
Derived Column transform Creates new column values by applying expressions to input columns.
One-time data migrations can range from simple data movement with no transformations to extremely complex movement with more than one source and heavyweight data transformation logic. Packages for complex data movement can evolve from those created for simple one-time data movement to those built from scratch by SSIS developers using SQL Server Data Tools. An employee getting data from a database table and importing it into Excel for further data analysis and processing is an example of simple one-time data movement. Such users do not necessarily have in-depth knowledge of ETL concepts or SSIS features. The Import and Export Wizard in SSIS helps such users build simple data movement solutions. This wizard constructs and uses an SSIS package behind the scenes, hiding the complexity involved in building a package. The packages created by the wizard involve source and destination adapters for the data stores participating in the movement. Figure 1-2 shows a step in the wizard for selecting tables at the source to copy to the destination. After the wizard has created a package, you can save and edit it later using the SQL Server Data Tools environment (discussed in detail later in this chapter). This capability comes in handy for SSIS developers who might have to update packages created by information workers, adding more complex transformations before making those packages available for use by the IT department. Data sources and destinations supported by the Import and Export Wizard include the following:
Relational databases that support a .NET Framework Provider or OLE DB Provider
Microsoft Office files: Access and Excel
Delimited data files in plain text
You can enable simple transformation capabilities in wizard-created packages to carry out data type mapping between a source and a destination. To avoid complexity when dealing with data types, the wizard automatically maps data types of each column selected for data movement at the source to the types of destination columns, using data type mapping files that are part of the SSIS installation for this purpose. SSIS provides default mapping files in XML format for commonly used source and destination combinations. For example, the wizard uses a mapping file called DB2ToMSSql10.xml when moving data from DB2 to SQL Server 2008 or a newer version. This file maps each data type in DB2 to the corresponding types in SQL Server 2008 or later. Listing 1-1 shows a portion of this file that maps between the Timestamp data type in DB2 and the SQL Server datetime2 type.
Listing 1-1 Data type mapping in DB2ToMSSql10.xml
<?xml version="1.0" encoding="utf-8" ?> <dtm:DataTypeMappings xmlns:dtm="http://www.microsoft.com/SqlServer/Dts/DataTypeMapping.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" SourceType="DB2OLEDB;Microsoft.HostIntegration.MsDb2Client.MsDb2Connection" MinSourceVersion="*" MaxSourceVersion="*" DestinationType="SQLOLEDB;SQLNCLI*;System.Data.SqlClient.SqlConnection" MinDestinationVersion="10.*" MaxDestinationVersion="*"> ... <!-- TIMESTAMP 10.* --> <dtm:DataTypeMapping> <dtm:SourceDataType> <dtm:DataTypeName>TIMESTAMP</dtm:DataTypeName> </dtm:SourceDataType> <dtm:DestinationDataType> <dtm:NumericType> <dtm:DataTypeName>datetime2</dtm:DataTypeName> <dtm:SkipPrecision/> <dtm:UseSourceScale/> </dtm:NumericType> </dtm:DestinationDataType> </dtm:DataTypeMapping> ... </dtm:DataTypeMappings>
Different usage scenarios of Import and Export Wizard lead to different methods to launch it. For example, a SQL Server database developer or administrator who wants to import data from Microsoft Office Excel to a database could invoke the wizard from SQL Server Management Studio in the context of the destination database. This option allows you to save the package constructed by the wizard and then either execute it later or run it right away.
Another way to invoke the wizard is from SQL Server Data Tools. SSIS beginners who want to start with some basic data-moving package constructs can invoke the wizard from Solution Explorer in SQL Server Data Tools and add a package to the current solution. After adding the package, they can edit it for improvements and then save it just like any other package in the SQL Server Data Tools solution.
One-time migration scenarios often involve copying data objects and data from one instance of SQL Server to another. SSIS supports this scenario with a few tasks that you can use to transfer databases, logins, objects, stored procedures in a master database, user-defined error messages, or SQL Agent jobs between two SQL Server instances. All these transfer tasks use the SQL Management Object (SMO) connection manager to connect to the SQL Servers involved in the transfer.
Figure 1-2 Make your selections in the Import and Export wizard.
You perform recurring data movements by using a SQL Server Agent job to schedule the execution of the SSIS package that performs the operation.
Loading a Data Warehouse
SSIS is a quintessential ETL tool and platform for traditional data warehouse (DW) loading. Loading DWs is the most popular use for SSIS in enterprises. In data warehousing, data is moved from operational data stores to a central location that’s optimized for analysis and reporting. This type of loading can be done either incrementally or as a bulk refresh and it typically involves the following operations:
Extract all or changed data from multiple sources
Transform extracted data before loading at the destination
Load dimension and fact tables at the destination
Reference data lookup
Key generation
Manage historical changes
SSIS can be used effectively to implement all of these operations. Extracting data from multiple sources, transforming data, and loading tables are discussed briefly in the scenarios described earlier in this chapter. Getting changed data is discussed in the next section. This section examines other DW loading operations.
Reference data lookup involves getting data from an external reference dataset. For example, given a customer identifier, a DW loading process might need to retrieve additional data about the customer; for example, getting the customer’s zip code from a customer relationship management (CRM) server that stores all customer-related information. In this case, the external dataset on the CRM server is used as the reference to add more data to the ETL pipeline. SSIS supports this step in DW processing using the Lookup Transform component. The reference dataset can be an existing table or view in a relational database, the result of a SQL query, or the Lookup Cache file. SSIS needs reference data in memory for the lookup operation. For large reference sets, the data can be preloaded into a special type of file called Lookup Cache for efficiency and high performance. The Lookup transformation uses either an OLE DB connection manager or a Lookup Cache connection manager to connect to the reference dataset. The Lookup transformation performs lookups by joining data in input columns with columns in the reference dataset. Lookup Transform can use multiple inputs for the lookup operation. Lookup inputs cannot be of certain data types like image or text. For supported string data types, the lookup operation is case sensitive and Character Map transformation has to be used explicitly to convert lookup input to upper or lowercase to match the case in the reference data. Inputs not matched in the reference data can be redirected using the Lookup transform. Connectivity from Lookup transform to relational databases is limited to OLE DB connectivity and the supported data sources are SQL Server, Oracle, and DB2. If the reference data source does not support OLE DB connectivity, a Data Flow task using any supported source adapter and Cache transform is used to create a cache file for lookup operations in another Data Flow task.
Generating unique keys and using them to substitute a natural primary key for dimension tables in DWs is a comment pattern. These keys are often referred to as surrogate or artificial keys and they are quite useful in making dimensional updates easier in cases where natural primary keys can change. When the natural key is an alphanumeric or composite surrogate, you use an integer key for better performance. Surrogate key generation during ETL processes is considered better than using key generation at destinations (such as using the identity key feature in SQL Server) during row inserts because referential integrity established at the destination using such destination-generated keys can break during data movement. In contrast, surrogate keys enable data portability. There is no built-in functionality in SSIS to generate keys, but it is quite easy to do. Scripting or custom extensions are popular approaches for building surrogate key generators in SSIS. Surrogate key generation typically involves getting the maximum value of the surrogate key currently used in the table of interest, and using it as a seed to assign a key to each row in the data flow pipeline, with predefined increments. Some SSIS developers prefer to maintain the seed value within their ETL systems without having to query the database at the beginning of every ETL process.
In DWs, dimensional data can change over time. Such dimensions are commonly referred to as slowly changing dimensions (SCDs). SCD processing and managing historical changes is one of the more difficult steps in DW loading operations. There are three common types of SCDs:
Type 1 Old data is overwritten, and historical changes are not preserved. This type is often used with dimensional data that is no longer correct and when the historical values carry no business value.
Type 2 Historical changes are preserved, and a row is added for each new value. This is the most common type of change in DW dimensions. Each row (current and historical) of a dimensional value will have its own surrogate key, version number, or timestamp that can be used to get the most recent value.
Type 3 Create columns used to maintain current value and history. You would employ this method when changes happen only rarely, or at predefined intervals, and you need to maintain only a few of the most recent historical values.
SSIS supports this operation using the Slowly Changing Dimension transform, which coordinates the update and insertion of records in DW dimension tables. This transformation supports four types of changes:
Changing attribute Supports Type 1 SCD described earlier.
Historical attribute Supports Type 2 SCD described earlier.
Fixed attribute Changes indicate that the column value must not change. Rows that try to change these values are redirected for further processing.
Inferred member A placeholder record is created when dimension values are not yet available.
Type 3 changes are not supported in SCD transforms, but can be handled using a combination of other SSIS components. The Slowly Changing Dimension transformation has one input and up to six outputs. Each output corresponds to the update, insert, or other processing requirements of the record in the dimension table at the destination. During execution, the SCD transform identifies incoming row records with matches in a lookup table using its connection manager. After the match is found, SCD identifies the update type for each row and column, then redirects the row into the appropriate output for handling the change correctly. For example, in Type 1 dimension processing, the SCD transform outputs the row to the Changing Attributes Updates output, which is connected to OLE DB Command transform that updates the record in the dimension table using a SQL UPDATE statement.
Constructing and configuring steps for SCD processing can get complex during ETL design. SSIS supplies a wizard to help developers go through standard steps in SCD management. The wizard produces transformations for SCD processing in the Data Flow task and works only with dimension tables in SQL Server databases. Figure 1-3 shows the typical SSIS components involved in SCD processing. All the components shown (except the source adapter) are added by the wizard.
Figure 1-3 Slowly changing dimension processing in SSIS.
Cleaning, Formatting, or Standardization of Data
Data in organizations exist in many forms, shapes, and quality levels. Different parts of the organization might use different conventions and formats. During interbusiness transactions, data is acquired from other organizations that might use different data stores and have different data quality standards. Global organizations maintain data in different locales to serve domestic business needs in different geographical regions. Data can also get corrupted during transactions, and such data needs to be isolated during processing.
Data integration processes need to deal with these issues, gathering all the data and ensuring it’s in a consistent state before processing the consolidated data in the integration environment or loading it into destinations. Most data integration tools have capabilities to help deal with dirty data, including incorrect spellings, inaccurate or invalid data, duplicates, or unexpected abbreviations.
SSIS provides some data cleansing options suitable for various customer needs, including the Fuzzy Lookup and Fuzzy Grouping transformation components that act as generic data processing operations without requiring an expert collection of domain-specific rules. Fuzzy Lookup helps in matching incoming and potentially low-quality data with a cleaned and standardized reference data set. It outputs the closest match in reference data and quality of the match. Fuzzy Grouping helps in identifying groups of rows in incoming data that are possibly referring to the same entity in a string column leading to duplicate detection in data. SSIS has Data Quality Services (DQS) Cleansing transform in SQL Server 2012. This transform is used to perform data correction and deduplication using knowledge bases built using DQS. During execution time, cleansing work happens in DQS server using the knowledge bases referenced in the transform and data is sent to the DQS server for cleansing. DQS is not a part of SSIS. It is another component in the Microsoft SQL Server product line like SSIS, providing knowledge-driven cleansing functionality.
In addition to the special-purpose cleansing transforms just described, data standardization and formatting can be done in SSIS using the following features:
Character Map transform Applies string functions to character data.
Data Conversion transform Converts the data in an input column to a different data type.
Derived Column transform Creates new column values by applying expressions on input columns.
Data comparisons and replacement Functions are used in expressions that are computed on input columns.
Cleansing and format manipulation are useful, but in most cases the nature of data has to be well understood before any type of processing happens. SSIS provides a feature called the Data Profiling task that compiles statistics on the data and can be helpful in identifying cleansing needs and minimizing data quality issues. This task is configured to compute one or more profiles. The results, which are given in XML format, can be stored in a file or in an SSIS variable. Profiling results saved to a file can be viewed in the Data Profiler Viewer. It is possible to control the workflow in SSIS packages using the results of the profiling task.
Chapter 9, “Data Profling and Cleansing Using SSIS,” discusses all these data quality and cleansing capabilities in SSIS in more detail.
Identification, Capture, and Processing of Data Changes
Ever-growing volumes of data in organizations, the need for real-time reports, and a reduced batch window for data processing have all resulted in the demand for change-processing capabilities in data integration tools. Data integration processing is efficient when it can run against an incremental data set instead of the all the data available in the data stores involved. Processing data incrementally reduces the run duration of data integration processes, which in turn can help by increasing the frequency at which these processes run. Using timestamp columns, checksum, or hash-function-based solutions for change detection and capture is a common industry practice. A relatively recent and popular alternative offered by several database vendors provides the native ability to identify changed data. Data integration tools can leverage these features to identify and capture changes for incremental data processing. For example, SQL Server provides Change Data Capture (CDC) and Change Tracking (CT) features, and SSIS provides built-in and custom options to process changed data that can leverage CDC functionality when it is used.
Data integration solution developers can make use of SQL Server CDC infrastructure directly during package design. The CDC feature in SQL Server is powerful, but it is also complex, and involves considerable custom logic. For streamlined CDC processing, SSIS provides the following tasks and components:
CDC Control task
CDC Source
CDC Splitter transform
The CDC Control task is useful for controlling various stages in change data processing in SSIS packages. It requires one connection manager to the database where changes need to be identified and captured, and another (optional) connection manager to persist the state of the CDC processing operation stored in an SSIS variable. CDC processing stages managed in SSIS using this task include marking the beginning and ending of initial data load, the start of the operation, and the range of data processed. CDC Source extracts changed rows within a specified processing range. Those rows can be obtained by the control task. The source component uses database artifacts generated by SQL Server during CDC setup on the database to be observed for changes. The CDC Splitter transformation routes changed data extracted by CDC Source into three output categories—Insert, Update, and Delete—applying different processing logic to each category. Chapter 8, “Working with Change Data Capture in SSIS 2012”, discusses these components in detail. As mentioned earlier, these CDC components support only SQL Server databases. SSIS supports CDC on Oracle database using a Windows service that mimics Oracle database changes in SQL Server database, enabling change processing using the CDC task and components.
Coordination of Data Maintenance, Processing, or Analysis
The Data Flow task that supports data extraction, transformation, and loading described in previous sections is executed in the context of a workflow defined in the Control Flow section of SSIS packages. Control Flow in SSIS is task oriented, and it coordinates execution of data processing tasks in the business workflow. In addition to the special-purpose Data Flow task powered by a buffer-based data processing engine, SSIS has several built-in tasks that you can use when building the control flow. These tasks, executed by the SSIS runtime engine, are useful for operations such as database administrative steps, preparation for dataflow execution, command execution on Analysis Server, and many other operations common in data integration. You can also build custom tasks using the SSIS programming model and use those as a part of the Control Flow. Execution of scripts is also possible in the Control Flow.
Three types of containers are available to host tasks. The Sequence container is used to group tasks and containers to manage them as one unit. The For Each container facilitates repeating steps in a control flow by enumerating files or objects, and the For Loop container provides another option for repeating steps in a Control Flow that uses a conditional expression. These containers can host other containers along with tasks. Tasks and containers in the Control Flow are connected by precedence constraints that determine the execution path in the workflow. Precedence constraints specify the order in which tasks and task containers are executed during runtime, or conditions that determine which part of the workflow will execute next. The simplicity of using constraints makes orchestration of steps in the SSIS workflow easy to build, debug, and manage.
SSIS includes tasks for performing database maintenance operations. These tasks are useful in building database maintenance plans in SQL Server Management Studio and also in SQL Server Data Tools along with other tasks that you can use when constructing control flow in SSIS. Figure 1-4 shows the SQL Server Management Studio designer in the process of building a database maintenance plan using SSIS maintenance tasks. Some of the popular database maintenance tasks are the following:
Backup Database task Performs SQL Server database backups.
Rebuild Index task Rebuilds indexes in SQL Server database tables and views.
Update Statistics task Updates information about the distribution of key values for one or more sets of statistics on the specified table or view.
Shrink Database task Reduces the size of SQL Server database data and log files.
Figure 1-4 Database maintenance tasks in SQL Server Management Studio.
Tasks available to build Control Flow are typically used for the preparation of Data Flow task execution. For example, the Execute SQL task is used to run SQL statements on data stores. This task is used for operations like creating or truncating tables, preparing staging database, getting maximum value of an identity column in a table, executing stored procedures, or getting a count of rows in a spreadsheet. The Execute SQL task can be used with a wide range of data sources and supports several connection managers for that. Getting data files from external systems to data integration servers is a common data loading preparation step. SSIS offers a few options for this operation, and the following are the most popular:
File System task Performs operations on files and directories in the file system. For example, this task can be used to get data files from a remote file share and copy them to a newly created directory in the local file system.
FTP task Downloads data files from an FTP server, uploads data files to an FTP server, or manages directories on server. For example, this task can be used to get data from an FTP location for ETL processing and drop the original file from that location after download.
Web Service task Executes a Web service method. For example, this task can be used to get data from a Web service that can be written to a variable or file.
If the data file obtained using one of these tasks is in XML format, you can use the XML task in SSIS to process it. This task can reformat data using XSLT, select XML nodes using XPath queries, or merge multiple XML documents. If the data file is a text or XML file and the data has to be loaded into SQL Server without any transformations, you can use the Bulk Insert task, which wraps the BULK INSERT statement in SQL Server. This provides an efficient way to copy large amounts of data to SQL Server tables or views.
SSIS has two Analysis Services tasks used for executing operations against Microsoft SQL Server Analysis Services databases. The Analysis Services Execute DDL task is used to execute Data Definition Language statements to create, drop, or alter Analysis Services mining models, cubes, or dimensions. The Analysis Services Processing task is used to process these artifacts once they are created.
Any data maintenance or processing functionality not available in SSIS can be implemented using the Script task. For example, if preparation for a Data Flow task involves getting data from a source that does not have a built-in connector from SSIS or if a particular data processing step to be included in Control Flow is not available in SSIS, scripting can help and the Script task is used for that.
SSIS packages can be configured to restart execution from the point of failure if a step in the Control Flow fails. You control package restart using checkpoint files. If a package uses checkpoint files, information about the package execution is written to the file. Therefore, if the package must be restarted after a failure, execution can restart with the failed task or container. SSIS checkpoints are particularly useful to avoid unnecessarily repeating heavyweight operations that have already completed successfully in the Control Flow when rerunning a package, such as copying a huge file, or executing a time-consuming data flow.
Administrative operations and maintenance steps often need to be executed on a regular basis and at scheduled time intervals. SQL Server Agent is used to automate execution of SSIS packages.