Script to enable fast-diff on an entire pool of images and rebuild the object-map

This script enables the requisite features on all RBD images in a pool to allow you to run rbd du and have it return a result quickly as opposed to having to calculate the size very time

rbd ls -p backup1 | while read line; do
  echo "$line"
  rbd feature enable backup1/$line object-map exclusive-lock
  rbd object-map rebuild backup1/$line
  rbd snap ls backup1/$line | while read snap; do
        export snapname=$(echo $snap| awk '{print $2;}')
        if [ ! $snapname == "NAME" ]
        then
                echo "$line@$snapname"
                rbd object-map rebuild backup1/$line@$snapname
        fi
  done
done

Create Bluestore OSD backed by SSD

Dont take my word for it on the WAL sizing – Check http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/

This script will create a spare 20G Logical volume to use as the WAL for a second spinner later if you need it

export SSD=sdc
export SPINNER=sda

vgcreate ceph-ssd-0 /dev/$SSD
vgcreate ceph-hdd-0 /dev/$SPINNER

lvcreate --size 20G -n block-0 ceph-hdd-0
lvcreate -l 100%FREE -n block-1 ceph-hdd-0

lvcreate --size 20G -n ssd-0 ceph-ssd-0
lvcreate --size 20G -n ssd-1 ceph-ssd-0
lvcreate -l 100%FREE -n ssd-2 ceph-ssd-0

ceph-volume lvm create --bluestore --data ceph-hdd-0/block-1 --block.db ceph-ssd-0/ssd-0
ceph-volume lvm create --bluestore --data ceph-ssd-0/ssd-2

Ceph Nautilus – “Required devices (block and data) not present for bluestore”

When using the new ceph-volume scan and activate commands on ceph Nautilus after an upgrade from Luminous I was getting the following message

[root@ceph2 ~]# ceph-volume simple activate --all
--> activating OSD specified in /etc/ceph/osd/37-11af5440-dadf-40e3-8924-2bbad3ee5b58.json
Running command: /bin/ln -snf /dev/sdh2 /var/lib/ceph/osd/ceph-37/block
Running command: /bin/chown -R ceph:ceph /dev/sdh2
Running command: /bin/systemctl enable ceph-volume@simple-37-11af5440-dadf-40e3-8924-2bbad3ee5b58
Running command: /bin/ln -sf /dev/null /etc/systemd/system/ceph-disk@.service
--> All ceph-disk systemd units have been disabled to prevent OSDs getting triggered by UDEV events
Running command: /bin/systemctl enable --runtime ceph-osd@37
Running command: /bin/systemctl start ceph-osd@37
--> Successfully activated OSD 37 with FSID 11af5440-dadf-40e3-8924-2bbad3ee5b58
--> activating OSD specified in /etc/ceph/osd/11-8c5b0218-4d32-404f-b06b-f6e90906ab7d.json
--> Required devices (block and data) not present for bluestore
--> bluestore devices found: [u'data']
--> RuntimeError: Unable to activate bluestore OSD due to missing devices

You can see one volume activated while the other didn’t.
It turns out this is because one volume was configure for bluestore and the other wasn’t and there is some sort of a bug in the ceph-volume command and when it writes out the /etc/ceph/osd/{OSDID}-GUID.json files it omits the “type”: “filestore” line for any non-bluestore disks, but the ceph-volume activate command assumes it’s a bluestore volume unless otherwise specified in the json file.
The quick and easy fix was to add the line “type”: “filestore”, to the json files for any non-bluestore disks and run ceph-volume simple activate –all again.
Time permitting i’ll hunt down the bug in the scan command and submit a pull request if it’s not already been done

Ceph scrubbing performance

Original article here – http://sudomakeinstall.com/linux-systems/ceph-scrubbing

Ceph’s default IO priority and class for behind the scene disk operations should be considered required vs best efforts. For those of us who actually utilize our storage for services that require performance will quickly find that deep scrub grinds even the most powerful systems to a halt.

Below are the settings to run the scrub as the lowest possible priority. This REQUIRES CFQ as the scheduler for the spindle disk. Without CFQ you cannot prioritize IO. Since only 1 service utilizes these disk CFQ performance will be comparable to deadline and noop.

Show the current scheduler

for file in /sys/block/sd*; do
echo ${file}
cat ${file}/queue/scheduler
echo “”
done

Set all disks to CFQ

for file in /sys/block/sd*; do
echo cfq > ${file}/queue/scheduler
cat ${file}/queue/scheduler
echo “”
done

Inject the new settings for the existing OSD:
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 7'
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle'

Edit your ceph.conf on your storage nodes to automatically set the the priority at runtime.
#Reduce impact of scrub.
osd_disk_thread_ioprio_class = "idle"
osd_disk_thread_ioprio_priority = 7

You can go a step further and setup redhats optimizations for the system charactistics.
tuned-adm profile latency-performance
This information referenced from multiple sources.

Reference documentation.
http://dachary.org/?p=3268

Disable scrubbing in realtime to determine its impact on your running cluster.
http://dachary.org/?p=3157

A detailed analysis of the scrubbing io impact.
http://blog.simon.leinen.ch/2015/02/ceph-deep-scrubbing-impact.html

OSD Configuration Reference
http://ceph.com/docs/master/rados/configuration/osd-config-ref/

Redhat system tuning.
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Performance_Tuning_Guide/sect-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Tool_Reference-tuned_adm.html

Openstack Queens\Rocky – Glance: Create an image from an existing RBD volume

Seems like something that should be simple!

Step1 – Create the empty image

glance image-create --container-format=bare --disk-format=raw  --min-ram 2048 --name="Windows Server test"
+------------------+--------------------------------------+
| Property         | Value                                |
+------------------+--------------------------------------+
| checksum         | None                                 |
| container_format | bare                                 |
| created_at       | 2020-07-14T03:37:29Z                 |
| disk_format      | raw                                  |
| id               | ca9bf831-df20-4c7c-884e-08e355e5a012 |
| locations        | []                                   |
| min_disk         | 0                                    |
| min_ram          | 2048                                 |
| name             | 3CX - July 2020                      |
| os_hash_algo     | None                                 |
| os_hash_value    | None                                 |
| os_hidden        | False                                |
| owner            | ccd547784c804de99a63dd17dfb7ff15     |
| protected        | False                                |
| size             | None                                 |
| status           | queued                               |
| tags             | []                                   |
| updated_at       | 2020-07-14T03:37:29Z                 |
| virtual_size     | None                                 |
| visibility       | shared                               |
+------------------+--------------------------------------+

Step 2 – Update the ‘location’ attribute

glance location-add --url "rbd://b42a82f3-f493-49f4-98e0-2d355bbe8ee3/saspool/image-Windows2016v1/snap" 763a2ca2-e8f8-4bf9-974f-98d7020e200b

OR if using the Openstack cli

openstack image set --property direct_url='rbd://cbb8d45d-79ee-4cf5-9ace-eeb145c89fd2/pool/image-pfsense235/snap' ee52d217-d2c8-4be4-b576-7766373de9e7

You’ll obviously need to ensure you have a protected snapshow of your image like so

:~# rbd snap create saspool/image-Windows2016v1@snap

:~# rbd snap protect saspool/image-Windows2016v1@snap

:~# rbd snap ls saspool/image-Windows2016v1

SNAPID NAME   SIZE TIMESTAMP

18 snap 150 GB Thu Apr 26 13:24:30 2018

You’ll also need to ensure that show_multiple_locations = true is set in glance-api.conf

Ceph commands

Get status of monitors

ceph quorum_status --format json-pretty

Set the weight of a disk lower so it gets less IO, or use it to scale a disk in\out of a cluster

ceph osd reweight osd.30 .01 --cluster mycluster

Move an OSD to a node in the cursh map, usefull if your OSD comes up not attached to a node

ceph osd crush move osd.27 host=adleast-node04 --cluster mycluster

Ensure this OSD is never a primary – Requires a setting on the mon

ceph osd primary-affinity osd.27 0 --cluster mycluster

How many PG’s on OSD x are primary?

ceph pg dump | grep active+clean | egrep "\[0," | wc -l

Activate an existing bluestore disk

ceph-volume lvm activate --bluestore 34 2c9b2862-1db6-4863-935d-32b7289fee7d

Find all bluestore disk ID’s and UUID’s

(Used to generate the above command)

lvs -o lv_tags | awk -vFS=, /ceph.osd_fsid/'{ OSD_ID=gensub(".*ceph.osd_id=([0-9]+),.*", "\\1", ""); OSD_FSID=gensub(".*ceph.osd_fsid=([a-z0-9-]+),.*", "\\1", ""); print OSD_ID,OSD_FSID }' | sort -n | uniq

Get the count of Ceph PG’s per OSD

ceph pg dump | awk '
BEGIN { IGNORECASE = 1 }
 /^PG_STAT/ { col=1; while($col!="UP") {col++}; col++ }
 /^[0-9a-f]+\.[0-9a-f]+/ { match($0,/^[0-9a-f]+/); pool=substr($0, RSTART, RLENGTH); poollist[pool]=0;
 up=$col; i=0; RSTART=0; RLENGTH=0; delete osds; while(match(up,/[0-9]+/)>0) { osds[++i]=substr(up,RSTART,RLENGTH); up = substr(up, RSTART+RLENGTH) }
 for(i in osds) {array[osds[i],pool]++; osdlist[osds[i]];}
}
END {
 printf("\n");
 printf("pool :\t"); for (i in poollist) printf("%s\t",i); printf("| SUM \n");
 for (i in poollist) printf("--------"); printf("----------------\n");
 for (i in osdlist) { printf("osd.%i\t", i); sum=0;
   for (j in poollist) { printf("%i\t", array[i,j]); sum+=array[i,j]; sumpool[j]+=array[i,j] }; printf("| %i\n",sum) }
 for (i in poollist) printf("--------"); printf("----------------\n");
 printf("SUM :\t"); for (i in poollist) printf("%s\t",sumpool[i]); printf("|\n");
}'

 

Original article – http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd