当前位置: 首页>数据库>正文

spring mongo 字段存储忽略 mongodb存储时序数据

MongoDB is a document database where you can store data directly in JSON format. It is very simple to start and create an application using MongoDB. Yet it is a powerful tool.

MongoDB是一个文档数据库,您可以在其中直接以JSON格式存储数据。 使用MongoDB启动和创建应用程序非常简单。 但这是一个强大的工具。

And you can use it to store a time-series data into it.

您可以使用它来存储时间序列数据。

(Introduction)

In a simplified way, a time-series is a series of data in time order. It can be constant, that is, the interval between each entry is equal (seconds, minutes, hours), or it can have different time intervals. The important fact is that each entry has a sequenced timestamp associated with it.

以简化的方式,时间序列是按时间顺序排列的一系列数据。 它可以是恒定的,即每个条目之间的间隔相等(秒,分钟,小时),也可以具有不同的时间间隔。 重要的事实是每个条目都有一个与之关联的顺序时间戳。

There are many situations in real world that use a time-series data:

现实世界中有许多情况使用时间序列数据:

  • A weather station acquiring humidity, temperature, and pressure data. You want to know when one given temperature was acquired, so you predict whether you have to wear a jacket to go out.
  • A stock market analyst, who uses all the stock prices over time to run some analysis and identify opportunities.
  • A formula one car that sends telemetry information each second, such as speed, fuel consumption, temperatures, so the engineers can calculate and tell the driver what to do next.

There are some ways of storing them in a Mongo database and we are going to see each of them.

有一些方法可以将它们存储在Mongo数据库中,我们将分别介绍它们。

(Single document)

The first method is to store each acquired data as a single document into the database. That is, each timestamp (or interval) will correspond to an entry into the database.

第一种方法是将每个获取的数据作为单个文档存储到数据库中。 也就是说,每个时间戳(或间隔)将对应于数据库中的一个条目。

Imagine we have a weather system that produces sensor data each second. The sample contains data such as temperature, humidity, and pressure, along with the weather id.

想象一下,我们有一个气象系统,每秒产生一次传感器数据。 该样本包含温度,湿度和压力等数据以及天气ID。

Every time the information is acquired, the sensor sends this to the database which stores it as a single document.

每次获取信息时,传感器会将其发送到数据库,该数据库将其存储为单个文档。




spring mongo 字段存储忽略 mongodb存储时序数据,spring mongo 字段存储忽略 mongodb存储时序数据_人工智能,第1张


Image by author. 图片由作者提供。

As you can see, there is some repetitive or useless information, such as _id and deviceId. For each second, this two information will be stored, again and again, increasing the size of the collection.

如您所见,存在一些重复的或无用的信息,例如_iddeviceId 。 对于每秒,这两个信息将一次又一次地存储,从而增加了集合的大小。

For our scenario, it will create 3600 documents in only one hour, or 86400 per day! This will impact on performance and disk usage.

对于我们的方案,它将在一个小时内创建3600个文档,或每天创建86400个文档! 这将影响性能和磁盘使用率。

To insert it using Python is trivial:

使用Python插入它很简单:

data = {
  "pressure":945.65,
  "humidity":42.12,
  "temperature":28.41,
  "timestamp": datetime.now(),
  "deviceId": 1
}
db.single.insert_one(data)

时段文件(Time-bucket document)

Since our data is evenly spaced, we can group them by time. In this case, a document for each minute. We have then an array with 60 objects, each containing the sample from the sensor.

由于我们的数据间隔均匀,因此我们可以按时间对其进行分组。 在这种情况下,每分钟一个文档。 然后,我们有一个包含60个对象的数组,每个对象包含来自传感器的样本。


spring mongo 字段存储忽略 mongodb存储时序数据,spring mongo 字段存储忽略 mongodb存储时序数据_spring mongo 字段存储忽略_02,第2张

Image by author.

图片由作者提供。

If we compare with the previous option, we have more fields:

如果与上一个选项进行比较,我们将有更多字段:

  • d: datetime information of the minute the data samples were acquired. It can be a different group time, if we want, such as hours or days. It will depend on the frequency of our data or the application.
    d :获取数据样本的分钟的日期时间信息。 如果需要,可以是不同的小组时间,例如几小时或几天。 这将取决于我们数据或应用程序的频率。
  • nsamples: number of samples acquired in that minute.
    nsamples :在这一分钟内获取的样本数。
  • samples: an array containing all the samples from the sensor. Since we already have the timestamp information, and the data is evenly spaced, we can ignore it and only store the actual sensor data.
    samples :一个数组,其中包含来自传感器的所有样本。 由于我们已经有了时间戳信息,并且数据间隔均匀,因此我们可以忽略它而只存储实际的传感器数据。

These extra fields help us to query data later and as a way to better aggregate the data.

这些额外的字段可帮助我们以后查询数据,并更好地汇总数据。

To insert it with Python, we have to modify our data:

要使用Python插入它,我们必须修改数据:

data = {
  "pressure":945.65,
  "humidity":42.12,
  "temperature":28.41,
}
deviceId = 1
minute = datetime.utcnow().replace(second=0, microsecond=0)
    
db.time_bucket.update_one(
   {'deviceId': deviceId, 'd': minute},
   {
    '$push': {'samples': data},
    '$inc': {'nsamples': 1}
  },
  upsert=True
)

We use the update_one function instead of insert_one. It will find the document with deviceId equals 1 and the same minute and it will insert the data into the samples field.

我们使用update_one函数而不是insert_one 。 它将找到deviceId等于1且同一minute的文档,并将数据插入到samples字段中。

If no document with these characteristics was found (that is, a new minute), a new document will be created.

如果未找到具有这些特征的文档(即新的分钟),则将创建一个新文档。

Note: make sure the flag upsert=True is passed to the function, otherwise it will not create a new document.

注意:请确保将标志upsert=True传递给该函数,否则将不会创建新文档。

(Size-bucket document)

There are some data that are generated with irregular time intervals and some sensors can provide more data than others.

有些数据是按不规则的时间间隔生成的,有些传感器可以提供比其他传感器更多的数据。

In this scenario, a size base bucket may be a better option than a time based one. So you update and insert a new sample into the document based on the size of its array instead of the time.

在这种情况下,基于大小的存储桶可能比基于时间的存储桶更好。 因此,您可以根据其数组大小而不是时间来更新新样本并将其插入到文档中。


spring mongo 字段存储忽略 mongodb存储时序数据,spring mongo 字段存储忽略 mongodb存储时序数据_机器学习_03,第3张

Image by author.

图片由作者提供。

There is two more information in our document, to help with the queries and aggregations later:

我们的文档中还有两个信息,以帮助以后进行查询和聚合:

  • first: timestamp of the first data inserted into the samples field;
    first :插入samples字段的第一个数据的时间戳;
  • last: timestamp of the last data inserted into the samples field;
    last :插入到samples字段中的最后一个数据的时间戳;

Note: we can have other data, such as max and minimum values, if we want.

注意:如果需要,我们可以具有其他数据,例如最大值和最小值。

To insert it, it is almost the same as our previous data. We have to modify so we can calculate the data we want:

要插入它,它与我们之前的数据几乎相同。 我们必须进行修改,以便我们可以计算所需的数据:

data = {
  "pressure":945.65,
  "humidity":42.12,
  "temperature":28.41,
  "timestamp": datetime.utcnow()
}
deviceId = 1
    
db.size_bucket.update_one(
   {'deviceId': deviceId, 'nsamples': {'$lt': 200}},
   {
     '$push': {'samples': data},
     '$min': {'first': data['timestamp']},
     '$max': {'last': data['timestamp']},
     '$inc': {'nsamples': 1}
   },
   upsert=True
)

Again we use the update_one function, but now the comparison will be against the nsamples field: it has to be less than 200 (our bucket size, you can choose any value you like) and with the same deviceId.

再次使用update_one函数,但是现在将与nsamples字段进行比较:它必须小于200(我们的存储桶大小,您可以选择任意值)并且具有相同的deviceId

We use the operators $min and $max to automatically calculate the minimum and maximum values of the inserted data timestamps.

我们使用运算符$ min$ max来自动计算所插入数据时间戳的最小值和最大值。

(Final Thoughts)

Let’s compare the three collections we created so far.

让我们比较到目前为止我们创建的三个集合。


spring mongo 字段存储忽略 mongodb存储时序数据,spring mongo 字段存储忽略 mongodb存储时序数据_人工智能_04,第4张

Image by author. 图片由作者提供。


We have some nice insights after about an hour of data acquisition:

大约一个小时的数据采集后,我们有了一些不错的见解:

  • The collection single is 65% larger than the time_bucket one. If we have taken more data, there might be even a larger difference.
    集合singletime_bucket一个大65%。 如果我们获取了更多数据,则可能会有更大的差异。
  • Since the size_bucket collection has more information, the total size is larger than time_bucket. But even so, it is way less than the single one.
    由于size_bucket集合具有更多信息,因此总大小大于time_bucket 。 但是即使这样,它也比single方法要少。
  • For even spaced data, as is our case, the time_bucket is a better approach, as it has less information to be saved.
    对于像我们这样的均匀间隔的数据, time_bucket是一种更好的方法,因为它保存的信息较少。

If you have a time-series data, creating a time based bucket or even size based bucket can improve the disk usage and also the performance, as we could see in this article.

如果您有时间序列数据,那么创建一个基于时间的存储桶甚至基于大小的存储桶都可以提高磁盘使用率和性能,就像我们在本文中看到的那样。

And with Python, it is easy to insert them into the database, with almost no effort.

使用Python,可以轻松地将它们轻松插入数据库。

You can create an API with the nice FastAPI framework, as this article explains:

您可以使用漂亮的FastAPI框架创建API,如本文所述:



Hope you enjoyed this article. You can follow me here and on Twitter for more content like this.

希望您喜欢这篇文章。 您可以在这里和Twitter上关注我,以获取更多类似内容。


翻译自: https://levelup.gitconnected.com/time-series-data-in-mongodb-and-python-fccfb6c1a923


https://www.xamrdz.com/database/6ty1931109.html

相关文章: