Troubleshooting Guide

From SEPsesam
Revision as of 11:03, 30 March 2022 by Sta (talk | contribs) (Minor: changed links to troubleshooting sections/en)
Other languages:

Template:Copyright SEP AG en

Docs latest icon.png Welcome to the latest SEP sesam documentation version 4.4.3 Beefalo/5.0.0 Jaglion. For previous documentation version(s), check documentation archive.


Introduction

This guide is intended to help you quickly identify and resolve problems and errors during setup, installation and during normal operation of your SEP sesam system. SEP sesam often serves as an indicator that there has been a change or event that impacts overall system performance. Changes in SEP sesam backup performance are often caused by system changes or failures.

Template:First steps/en Interpreting Error Messages/en Setting Log Level/en For details on how to set a log level for specific backup or restore task in the GUI or globally for SEP sesam kernel modules, see Setting Log Level. Installation and Configuration Troubleshooting/en Troubleshooting Authentication/en Backup Troubleshooting/en Disaster Recovery Troubleshooting/en GUI Troubleshooting/en Network Troubleshooting/en MS SQL Troubleshooting/en MS Exchange Troubleshooting/en NetWare Troubleshooting/en Micro Focus GroupWise Troubleshooting/en Oracle Troubleshooting/en Informix Troubleshooting/en Lotus Domino Server Troubleshooting/en VMware Troubleshooting/en Citrix XEN Server Troubleshooting/en SAP HANA Troubleshooting/en SAP ASE Troubleshooting/en SAP ERP Troubleshooting/en NetApp Troubleshooting/en MySQL Troubleshooting/en BSR Pro for Windows Troubleshooting/en NDMP Troubleshooting/en Hyper-V Troubleshooting/en

KVM/QEMU

Merging and deleting leftover snapshots

Problem

  • The backup fails and the snapshot is left behind.

Solution

Use the virsh utility, as shown in the example:

  1. List the available snapshots for the domain:
  2.  user@hypervisor:~$ virsh snapshot-list <domain_name>
     Name                 Creation Time             State
     ------------------------------------------------------------
     Sesam_SF20173828282@XXXX      2017-07-06 08:15:11 +0200 disk-snapshot

    In this example, one leftover snapshot for this VM exists.

  3. List the virtual disks for the domain:
  4.  user@hypervisor:~$ virsh domblklist <domain_name>
     Target     Source
     ------------------------------------------------
     sda        /path/to//Sesam_SF20173828282@XXXX.snapshot
  5. For each device that refers to SEP sesam snapshot, start a block commit to merge the snapshot:
  6.  user@hypervisor:~$ virsh blockcommit <domain_name> sda --active --verbose --pivot
     Block Commit: [100 %]
     Successfully pivoted
  7. Confirm that the device is now switched to the original disk device:
  8.  user@hypervisor:~$ virsh domblklist <domain_name>
     Target     Source
     ------------------------------------------------
     sda        /my/original/base.img
  9. Delete the snapshot metadata information:
  10.  user@hypervisor:~$ virsh snapshot-delete <domain_name> --metadata --snapshotname Sesam_SF20173828282@XXXX
     Domain snapshot sesam_snapshot deleted
  11. Delete the snapshot file:
  12.  user@hypervisor:~$ rm /path/to//Sesam_SF20173828282@XXXX.snapshot


Si3 Deduplication

Unable to establish connection to S3 data store

Problem

Si3 NG data store may be unable to establish secure connection to S3 storage with the following error:

Error: Could not access data store. Server Status: 2023-03-30 10:17:10: ERROR Not started due to error: S3 is not connected Server Status: 2023-03-30 10:17:10: ERROR Not started due to error: S3 is not connected

Cause

In case Si3 NG data store connects to a storage provider that uses a self-signed certificate, this certificate is not recognized as trustworthy by default because it is not issued by a trusted certificate authority. This can result in connection being denied and log files in /var/opt/sesam/var/log/sms may contain a log message similar to this:

[...default-dispatcher-6] [1;31mERROR[0;39m [36mS3[0;39m - Unexpected error: {}, cause: {}
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: javax.net.ssl.SSLHandshakeException: General OpenSslEngine problem

Solution

To solve this problem use the keytool utility to import the public.crt certificate to the server certificate store. This will allow the Si3 server to recognize and trust the S3 storage provider's certificate, and establish a secure connection.

  1. Obtain the public certificate. Note that you can export it from the browser.
  2. Locate the cacerts file on your server. This is the location of your JVM certificate keystore.
  3. Import the public.crt certificate into the JVM's certificate keystore with the following command:
on Linux:
keytool -import -trustcacerts -keystore /var/lib/ca-certificates/java-cacerts -storepass changeit  -noprompt -alias <storage backend endpoint URL> -file /<path_to_certificate>/public.crt
on Windows:
C:\Program Files\ojdkbuild\java-11-openjdk-11.0.15-1\bin>keytool -import -trustcacerts -keystore "C:\Program Files\ojdkbuild\java-11-openjdk-11.0.15-1\lib\security\cacerts" -storepass changeit -noprompt -alias <storage backend endpoint URL> -file <path_to_certificate>\public.crt

Issues with S3 or S3-compatible storage

Problem

  • Si3 NG data store using S3 or S3-compatible storage can experience various issues, depending on cloud storage provider. These issues can affect backups, migrations, and replications. In addition, sanity state check of Si3 NG could report errors that have similar root cause.

Cause

  • Some cloud storage providers (for example, Wasabi) have request rate restrictions (how many HTTP(S) requests are allowed per second). Also on local storage with S3 option enabled, when multiple RDSs access the same local S3 storage, this can generate a lot of IOPS (I/Os per second).

Solution

  • You can adjust the settings on the affected Si3 NG data store:
  1. In the Main selection -> Components, click Data Stores to display the data store contents frame.
  2. Right-click the selected Si3 NG data store and then click Properties.
  3. Double-click a drive to open Drive Properties dialog, and then in Options field enter as follows:
dedup.s3.timeoutInSeconds=1200,dedup.s3.page.workers=2,dedup.maxAsyncRequests=50
This will increase the timeout period, active page workers and request rate.

Si3 remains in "shutting down" state

Problem

  • Manually stopping Garbage Collection (GC) fails and consequently Si3 remains in the "shutting down" state.

Solution

  • Restart the Si3 daemon by using sm_main restart sds. For more details on stopping and starting the SEP sesam services, see How to Start and Stop SEP sesam.

Si3 deduplication may not work with NFSv4

Problem

  • Si3 deduplication may not work with Network File System version 4 (NFSv4).

Cause

  • SEP sesam operations, such as backup, restore and migration, may fail due to Java problems with NFSv4.

Solution

  • To avoid this problem, connect your backup devices via NFSv3.

Repairing corrupted Si3 NG data store

You can repair the Si3 NG store when pages or objects get corrupted.

  1. First determine the scope of corruption:
    • To get the list of corrupted objects use:
      sm_dedup_interface -d <datastore> corruptedobjects
    • To get the list of corrupted pages use:
      sm_dedup_interface -d <datastore> corruptedpages
  2. Use the following command to replace the page in /pages directory with an older version from /pages-trash directory:
    sm_dedup_interface -d <datastore> repair pages
    The pages in trash contain all chunks deleted on previous GC. The oldest version of a page takes priority.
  3. Use the following command to search for and recover the missing chunks in /pages-trash directory:
    sm_dedup_interface -d <datastore> repair start
    During the repair process a new page is created, which contains all chunks from the current page (page affected by 'missing chunks' issue) and all chunks found in the trash.

Cleanup of unrecoverable Si3 NG store

SEP Warning.png Warning
You should use the commands described in this section only in case the corrupted store cannot be recovered.

When corruptions in the Si3 NG store persist, the initial page version has already been purged from trash or there were fatal errors during backup or restore. In this case broken pages or missing chunks cannot be recovered.

Cleanup can be performed by deleting unrecoverable objects manually or by using the automatic cleanup function.

Deleting objects

When there are only a few unrecoverable objects, delete each object with the following commands:

sm_dedup_interface -d <datastore> delete corruted_object_id_1

...

sm_dedup_interface -d <datastore> delete corruted_object_id_Nth

In case of many corruptions you can delete all corrupted objects using the following command:

sm_dedup_interface -d <datastore> fsck purge
Garbage collection

When you have deleted all unrecoverable objects, run garbage collection (gc):

sm_dedup_interface -d <datastore> gc start
Automatic cleanup function

To start an automatic cleanup function, use the following command:

sm_dedup_interface ... fsck purge auto

The automatic cleanup function runs the following sequence of commands: PCCK start -> OCCK start -> Delete all corrupted objects -> GC start.

Logging

The logging function uses a relatively powerful logback library. For more information, see Logback Project. Note that this information is intended for advanced users only.

Logging info
  • gv_rw_ini:sm_sds.xml (/var/opt/sesam/var/ini/sm_sds.xml)
  • /var/opt/sesam/var/log/sms contains two log files:
    • sm_dedup_server_info-<drive>.log: Log level INFO and higher.
    • sm_dedup_server-<drive>.log: Log level DEBUG and higher. This file can become quite large.
    • sm_dedup_gc-<drive>.log: garbage collection log.
    • sm_dedup_fsck-<drive>.log: file system check log.
  • Auto rotation if the log file size reaches 100 MB.

Files and directories

Objects

For every SEP sesam saveset, three objects (files) are stored in the Si3 NG store:

  • <ssid>.data
  • <ssid>.info
  • <ssid>.info2

The .data and .info files are identical to those of a normal data store. The .info2 file is required for the data to be appended to a Si3 object. All database information that is not available before a backup is completed is written to this file.

Directories


Tape and Tape Devices Troubleshooting/en VSS Troubleshooting/en

HPE StoreOnce Catalyst

HPE StoreOnce backups fail

If your backups fail, one of the reasons could be related to HPE StoreOnce sizing parameters. In this case, check the following:

  • If you have defined a Physical or Logical Storage Quota for your Catalyst store, check if the quota limits have been reached. If so, increase the quota to a sufficient size. For details, see HPE StoreOnce Configuration.
  • If you created an HPE Catalyst data store in SEP sesam and later changed the HPE Catalyst store size parameters, e.g., by changing or removing the storage quotas, check their values in SEP sesam GUI: Main selection -> Components -> Data Stores, double-click your StoreOnce store, and then click the HPE Catalyst Store State tab. If the data store status is set to "failed", you may need to adjust the StoreOnce data store (Size) Capacity and High watermark values to allow for correct calculation and make the data store functional again. For details, see Size and Disk space usage.

HPE StoreOnce backup failed with "553 STOR Failed. Error: Ctl_OpenObject failed, error: OSCLT_ERR_MAXIMUM_SESSIONS. (0)

Problem

  • HPE StoreOnce backup failed with 553 STOR Failed. Error: Ctl_OpenObject failed, error: OSCLT_ERR_MAXIMUM_SESSIONS. This issue can occur when an HPE StoreOnce Catalyst data session fails to open because the maximum number of sessions on the HPE StoreOnce Catalyst server is reached, thus no other session can be started until resources are available. You can check the number of free data sessions in SEP sesam by opening your HPE StoreOnce data store properties and clicking the tab HPE Catalyst Store State, then checking the value of Free data sessions.

Solution

There are two possible solutions:

  • The HPE Catalyst server supports a limited number of concurrent data sessions. Therefore, you need to ensure that the maximum number of data sessions for your HPE StoreOnce is not exceeded, or increase the maximum number of data sessions (including restores) that can be run simultaneously.
  • Open SEP sesam and from the Main selection -> Components -> Data Stores double-click your HPE StoreOnce data store to open the properties. Then, under Drives click the relevant drive to view its properties. In the Max. channels drop-down list, decrease the number of available channels for concurrently running backup/migration streams. By default, the number of available channels is set according to your SEP sesam Server license (e.g., the standard license supports 10 concurrent streams, enabling 10 backup processes to run simultaneously).


Proxmox VE

Proxmox VE backup does not work with the client name as plain IP address

Problem

  • If the node added to the SEP sesam environment is not set up correctly and an IP address is used as the client name instead of the client's hostname matching the hostname returned by the Proxmox server, the backup fails.

Solution

Proxmox VE backup failed with a connection error

Problem

  • Proxmox VE backup failed with Error connecting to proxmox System: "('Connection aborted.', gaierror(-2, 'Name or service not known'))"]. This error occurs after a VM has been migrated to another cluster node. Consequently, the Proxmox VE backup fails.

Solution

  • Ensure that every Proxmox VE cluster node can correctly resolve other cluster nodes using DNS or the Hosts file.

File backup of Proxmox hypervisor fails

Problem

  • Backup of /etc/pve/ fails.

Solution

  • Exclude the /etc/pve/ directory from backup. /etc/pve/ is just a file system representation of the cluster database, which is located under the path /var/lib/pve-cluster/config.db. For more information, see the Proxmox wiki article. For more information, see also Proxmox Cluster File System (pmxcfs).


See also

Error Messages Guide