3 Document Storage

Note

You need at least the modules in the Data Layer and one Frontend Client to work with document storage:

amcat instance after following this chapter

This chapter covers how you can upload, change, query and delete documents and indexes on the amcat server. Only a few tasks are implemented in the Web user interface, which means you will need to use one of the clients in R or Python or another means to call the API (e.g., through cURL as shown below). You can also use the API calls to build your own client. More information on the API can be found on every amcat4 instance at /redoc (e.g., http://localhost/amcat/redoc). Let us know about it and we will promote your new API wrapper package here.

3.1 Manage Documents With the Web Interface

Coming soon…

Will change soon

Currently, there is no way to upload, change, or delete documents and indexes through the web interface. Rather, you can add new datasets through calls to the amcat API.

3.2 Manage Documents With a Client

For this overview, we log into a local amcat4 (i.e., http://localhost/amcat). Replace this with the address to the amcat4 instance you are working with (e.g., https://opted.amcat.nl/api).

We first need to log in:

library(amcat4r)
amcat_login("http://localhost/amcat")

from amcat4py import AmcatClient
amcat = AmcatClient("http://localhost/amcat")

There is no dedicated way at the moment to get a token via cURL. You can still use cURL with instances that do not require authentication or by copying the token from Python or R. In these cases, you can make requests with an extra header, for example:

AMCAT_TOKEN="YOUR_TOKEN"
curl -s http://localhost/amcat/index/ \
  -H "Authorization: Bearer ${AMCAT_TOKEN}"

We can first list all available indexes, as a document collection is called in Elasticsearch and thus in amcat:

list_indexes()

# A tibble: 1 × 1
  name              
  <chr>             
1 state_of_the_union

amcat.list_indices()

[{'name': 'state_of_the_union'}]

curl -s http://localhost/amcat/index/

[{"name":"state_of_the_union"}]

You can see that the test index we added in the Data Layer section is here and that it is called “state_of_the_union”. To see everyone who has been granted access to an index we can use:

list_index_users(index = "state_of_the_union")

# A tibble: 0 × 0

amcat.list_index_users(index="state_of_the_union")

[]

curl -s http://localhost/amcat/index/state_of_the_union/users

[]

We will learn more about these roles in the chapter on access management. To see what an index looks like, we can query it leaving all fields blank to request all data at once:

sotu <- query_documents(index = "state_of_the_union", queries = NULL, fields = NULL)

Retrieved 232 results in 1 pages

str(sotu)

tibble [232 × 7] (S3: tbl_df/tbl/data.frame)
 $ .id      : chr [1:232] "tMmcK4YBzjTKB-2w5n6J" "tcmcK4YBzjTKB-2w5n6J" "tsmcK4YBzjTKB-2w5n6J" "t8mcK4YBzjTKB-2w5n6J" ...
 $ title    : chr [1:232] "1790: George Washington" "1790: George Washington" "1791: George Washington" "1792: George Washington" ...
 $ text     : chr [1:232] "Fellow-Citizens of the Senate and House of Representatives:  \nI embrace with great satisfaction the opportunit"| __truncated__ "Fellow-Citizens of the Senate and House of Representatives:  \nIn meeting you again I feel much satisfaction in"| __truncated__ "Fellow-Citizens of the Senate and House of Representatives:  \n\"In vain may we expect peace with the Indians o"| __truncated__ "Fellow-Citizens of the Senate and House of Representatives:  \nIt is some abatement of the satisfaction with wh"| __truncated__ ...
 $ date     : POSIXct[1:232], format: "1790-01-08" "1790-12-08" ...
 $ president: chr [1:232] "George Washington" "George Washington" "George Washington" "George Washington" ...
 $ year     : num [1:232] 1790 1790 1791 1792 1793 ...
 $ party    : chr [1:232] "N/A" "N/A" "N/A" "N/A" ...

sotu = list(amcat.query("state_of_the_union", fields=None))
print(len(sotu))

for k, v in sotu[1].items():
  print(k + "(" + str(type(v)) + "): " + str(v)[0:100] + "...")

_id(<class 'str'>): tcmcK4YBzjTKB-2w5n6J...
title(<class 'str'>): 1790: George Washington...
text(<class 'str'>): Fellow-Citizens of the Senate and House of Representatives:  
In meeting you again I feel much satis...
date(<class 'datetime.datetime'>): 1790-12-08 00:00:00...
president(<class 'str'>): George Washington...
year(<class 'float'>): 1790.0...
party(<class 'str'>): N/A...

To not clog the output, we save it into file and display only the beginning:

curl -s http://localhost/amcat/index/state_of_the_union/documents > sotu.json
# show the first few characters only
head -c 150 sotu.json

{"results":[{"_id":"tMmcK4YBzjTKB-2w5n6J","title":"1790: George Washington","text":"Fellow-Citizens of the Senate and House of Representatives:  \nI e

Knowing now what a document should look like in this index, we can upload a new document to get familiar with the process:

new_doc <- data.frame(
  title = "test",
  text = "test",
  date = as.Date("2022-01-01"),
  president = "test",
  year = "2022",
  party = "test",
  url = "test"
)
upload_documents(index = "state_of_the_union", new_doc)

from datetime import datetime
new_doc = {
  "title": "test",
  "text": "test",
  "date": datetime.strptime("2022-01-01", '%Y-%m-%d'),
  "president": "test",
  "year": "2022",
  "party": "test",
  "url": "test"
}
amcat.upload_documents("state_of_the_union", [new_doc])

curl -s -X POST http://localhost/amcat/index/state_of_the_union/documents \
  -H "Content-Type: application/json" \
  -d '{
         "documents":[
            {
               "title":"test",
               "text":"test",
               "date":"2022-01-01",
               "president":"test",
               "year":"2022",
               "party":"test",
               "url":"test"
            }
         ]
      }'

Let’s see if the the new document is in the index:

query_documents(index = "state_of_the_union", fields = NULL, filters = list(title = "test"))

Retrieved 1 results in 1 pages

# A tibble: 1 × 8
  .id                  title date                text  url   party presi…¹  year
  <chr>                <chr> <dttm>              <chr> <chr> <chr> <chr>   <dbl>
1 nMmfK4YBzjTKB-2wn39U test  2022-01-01 00:00:00 test  test  test  test     2022
# … with abbreviated variable name ¹president

import pprint
pp = pprint.PrettyPrinter(depth=4)
res=list(amcat.query("state_of_the_union", fields=None, filters={"title": "test"}))
pp.pprint(res)

[{'_id': 'nMmfK4YBzjTKB-2wn39U',
  'date': datetime.datetime(2022, 1, 1, 0, 0),
  'party': 'test',
  'president': 'test',
  'text': 'test',
  'title': 'test',
  'url': 'test',
  'year': 2022.0}]

curl -s -X POST http://localhost/amcat/index/state_of_the_union/query \
  -H 'Content-Type: application/json' \
  -d '{"filters":{"title":["test"]}}' | jq -r ".results"

[
  {
    "_id": "nMmfK4YBzjTKB-2wn39U",
    "title": "test",
    "date": "2022-01-01",
    "text": "test",
    "url": "test",
    "party": "test",
    "president": "test",
    "year": 2022
  }
]

We will learn more about queries later on in the Writing a Query chapter.

Instead of adding whole documents, you can also change fields in an index. Fields are similar to columns in a table in Excel. However, you need to define the type of a field upon its creation and make sure that you later only add data which adheres to the specifications of the type (otherwise you will get an error). To learn more about the fields in the test index, you can use:

get_fields(index = "state_of_the_union")

# A tibble: 7 × 2
  name      type   
  <chr>     <chr>  
1 date      date   
2 party     keyword
3 president keyword
4 text      text   
5 title     text   
6 url       url    
7 year      double

amcat.get_fields("state_of_the_union")

{'date': {'name': 'date', 'type': 'date'}, 'party': {'name': 'party', 'type': 'keyword'}, 'president': {'name': 'president', 'type': 'keyword'}, 'text': {'name': 'text', 'type': 'text'}, 'title': {'name': 'title', 'type': 'text'}, 'url': {'name': 'url', 'type': 'url', 'meta': {'amcat4_type': 'url'}}, 'year': {'name': 'year', 'type': 'double'}}

curl -s http://localhost/amcat/index/state_of_the_union/fields

{"date":{"name":"date","type":"date"},"party":{"name":"party","type":"keyword"},"president":{"name":"president","type":"keyword"},"text":{"name":"text","type":"text"},"title":{"name":"title","type":"text"},"url":{"name":"url","type":"url","meta":{"amcat4_type":"url"}},"year":{"name":"year","type":"double"}}

You can see that there are five different types in this index: date, keyword, text, url and double. Keyword, text, url are all essentially the same type in R, namely character strings. The date needs to be a POSIXct class, which you can create with as.Date. Year should be a double, i.e., a numeric value or integer.

You can add new fields to this, for example, if you want to add a keyword to the documents:

set_fields(index = "state_of_the_union", list(keyword = "keyword"))

amcat.set_fields("state_of_the_union", {"keyword":"keyword"})

curl -s -X POST http://localhost/amcat/index/state_of_the_union/fields \
  -H 'Content-Type: application/json' \
  -d '{"keyword":"keyword"}'

When you now query a document, however, you will not see this new field:

query_documents(index = "state_of_the_union", fields = NULL, filters = list(title = "test"))

Retrieved 1 results in 1 pages

# A tibble: 1 × 8
  .id                  title date                text  url   party presi…¹  year
  <chr>                <chr> <dttm>              <chr> <chr> <chr> <chr>   <dbl>
1 nMmfK4YBzjTKB-2wn39U test  2022-01-01 00:00:00 test  test  test  test     2022
# … with abbreviated variable name ¹president

res = list(amcat.query("state_of_the_union", fields=None, filters={"title": "test"}))
pp.pprint(res)

[{'_id': 'nMmfK4YBzjTKB-2wn39U',
  'date': datetime.datetime(2022, 1, 1, 0, 0),
  'party': 'test',
  'president': 'test',
  'text': 'test',
  'title': 'test',
  'url': 'test',
  'year': 2022.0}]

curl -s -X POST http://localhost/amcat/index/state_of_the_union/query \
  -H 'Content-Type: application/json' \
  -d '{"filters":{"title":["test"]}}' | jq -r ".results"

[
  {
    "_id": "nMmfK4YBzjTKB-2wn39U",
    "title": "test",
    "date": "2022-01-01",
    "text": "test",
    "url": "test",
    "party": "test",
    "president": "test",
    "year": 2022
  }
]

This is because it is empty for this document, just as the url field, which is absent from all documents in this index. We can add something to the new field and see if it shows up:

update_tags(index = "state_of_the_union", 
            action = "add", 
            field = "keyword", 
            tag = "test", 
            filters = list(title = "test"))

query_documents(index = "state_of_the_union", 
                fields = c("title", "keyword"),
                filters = list(title = "test"))

Retrieved 1 results in 1 pages

# A tibble: 1 × 3
  .id                  title keyword  
  <chr>                <chr> <list>   
1 nMmfK4YBzjTKB-2wn39U test  <chr [1]>

test_doc = list(amcat.query("state_of_the_union", fields=["id"], filters={"title": "test"}))[0]
amcat.update_document("state_of_the_union", doc_id=test_doc["_id"], body={"keyword": "test"})

res=list(amcat.query("state_of_the_union", fields=["title", "keyword"], filters={"title": "test"}))
pp.pprint(res)

[{'_id': 'nMmfK4YBzjTKB-2wn39U', 'keyword': ['test'], 'title': 'test'}]

test_doc=$(curl -s -X POST http://localhost/amcat/index/state_of_the_union/query \
  -H 'Content-Type: application/json' \
  -d '{"filters":{"title":["test"]}}' | jq -r ".results[]._id")
curl -s -X PUT http://localhost/amcat/index/state_of_the_union/documents/${test_doc} \
  -H 'Content-Type: application/json' \
  -d '{"keyword": "test"}'

curl -s -X POST http://localhost/amcat/index/state_of_the_union/query \
  -H 'Content-Type: application/json' \
  -d '{"filters":{"title":["test"]}}' | jq -r ".results"

[
  {
    "_id": "nMmfK4YBzjTKB-2wn39U",
    "date": "2022-01-01",
    "year": 2022,
    "text": "test",
    "title": "test",
    "keyword": [
      "test"
    ],
    "url": "test",
    "party": "test",
    "president": "test"
  }
]

Now that we have a better idea of what an index is and how it looks like, we can create a new one>

create_index(index = "new_index", guest_role = "admin")
list_indexes()

# A tibble: 2 × 1
  name              
  <chr>             
1 new_index         
2 state_of_the_union

get_fields(index = "new_index")

# A tibble: 7 × 2
  name      type 
  <chr>     <chr>
1 date      date 
2 party     text 
3 president text 
4 text      text 
5 title     text 
6 url       url  
7 year      text

amcat.create_index(index="new_index", guest_role="admin")

amcat.list_indices()

[{'name': 'new_index'}, {'name': 'state_of_the_union'}]

amcat.get_fields("new_index")

{'date': {'name': 'date', 'type': 'date'}, 'party': {'name': 'party', 'type': 'text'}, 'president': {'name': 'president', 'type': 'text'}, 'text': {'name': 'text', 'type': 'text'}, 'title': {'name': 'title', 'type': 'text'}, 'url': {'name': 'url', 'type': 'url', 'meta': {'amcat4_type': 'url'}}, 'year': {'name': 'year', 'type': 'text'}}

curl -s -X POST http://localhost/amcat/index/ \
  -H 'Content-Type: application/json' \
  -d '{
        "name": "new_index",
        "guest_role": "ADMIN"
      }'

curl -s http://localhost/amcat/index/
curl -s http://localhost/amcat/index/new_index/fields

[{"name":"new_index"},{"name":"state_of_the_union"}]{"date":{"name":"date","type":"date"},"party":{"name":"party","type":"text"},"president":{"name":"president","type":"text"},"text":{"name":"text","type":"text"},"title":{"name":"title","type":"text"},"url":{"name":"url","type":"url","meta":{"amcat4_type":"url"}},"year":{"name":"year","type":"text"}}

As you can see, the newly created index already contains fields. You could now manually define new fields to fit your data. Or you can simply start uploading data:

new_doc <- data.frame(
  title = "test",
  text = "test",
  date = as.Date("2022-01-01"),
  president = "test",
  year = "2022",
  party = "test",
  url = "test"
)
upload_documents(index = "new_index", new_doc)

get_fields(index = "new_index")

# A tibble: 7 × 2
  name      type 
  <chr>     <chr>
1 date      date 
2 party     text 
3 president text 
4 text      text 
5 title     text 
6 url       url  
7 year      text

new_doc = {
  "title": "test",
  "text": "test",
  "date": datetime.strptime("2022-01-01", '%Y-%m-%d'),
  "president": "test",
  "year": "2022",
  "party": "test",
  "url": "test"
}
amcat.upload_documents("new_index", [new_doc])

amcat.get_fields("new_index")

{'date': {'name': 'date', 'type': 'date'}, 'party': {'name': 'party', 'type': 'text'}, 'president': {'name': 'president', 'type': 'text'}, 'text': {'name': 'text', 'type': 'text'}, 'title': {'name': 'title', 'type': 'text'}, 'url': {'name': 'url', 'type': 'url', 'meta': {'amcat4_type': 'url'}}, 'year': {'name': 'year', 'type': 'text'}}

curl -s -X POST http://localhost/amcat/index/new_index/documents \
  -H "Content-Type: application/json" \
  -d '{
         "documents":[
            {
               "title":"test",
               "text":"test",
               "date":"2022-01-01",
               "president":"test",
               "year":"2022",
               "party":"test",
               "url":"test"
            }
         ]
      }'

curl -s http://localhost/amcat/index/new_index/fields

{"date":{"name":"date","type":"date"},"party":{"name":"party","type":"text"},"president":{"name":"president","type":"text"},"text":{"name":"text","type":"text"},"title":{"name":"title","type":"text"},"url":{"name":"url","type":"url","meta":{"amcat4_type":"url"}},"year":{"name":"year","type":"text"}}

amcat4 guesses the types of fields based on the data. You can see here that this might not be the best option if you care about data types: party and president have been created as text, when they should be keywords; year is now a long type instead of double or integer.

Finally, we can also delete an index:

delete_index(index = "new_index")

amcat.delete_index("new_index")

curl -s -X DELETE http://localhost/amcat/index/new_index

3.2.1 A Note on the ID Field and Duplicated Documents

amcat indexes can have all kinds of fields, yet one special field must be present in every document of every index: a unique ID. This ID is usually not that noteworthy, since the user does not really need to take care of it. This changes, however, with the special case of duplicated documents, that is, a document with the exact same information in the same fields. amcat does not normally check if your documents are duplicated when you upload them. However, when no ID is present in an uploaded document, as in the example we uploaded above, amcat will construct a unique ID from the available data of a document. Let us have another look at that document we titled “test”:

query_documents(index = "state_of_the_union", fields = NULL, filters = list(title = "test"))

Retrieved 1 results in 1 pages

# A tibble: 1 × 9
  .id          date                 year text  title keyword url   party presi…¹
  <chr>        <dttm>              <dbl> <chr> <chr> <list>  <chr> <chr> <chr>  
1 nMmfK4YBzjT… 2022-01-01 00:00:00  2022 test  test  <chr>   test  test  test   
# … with abbreviated variable name ¹president

res=list(amcat.query("state_of_the_union", fields=None, filters={"title": "test"}))
pp.pprint(res)

[{'_id': 'nMmfK4YBzjTKB-2wn39U',
  'date': datetime.datetime(2022, 1, 1, 0, 0),
  'keyword': ['test'],
  'party': 'test',
  'president': 'test',
  'text': 'test',
  'title': 'test',
  'url': 'test',
  'year': 2022.0}]

curl -s -X POST http://localhost/amcat/index/state_of_the_union/query \
  -H 'Content-Type: application/json' \
  -d '{"filters":{"title":["test"]}}' | jq -r ".results"

[
  {
    "_id": "nMmfK4YBzjTKB-2wn39U",
    "date": "2022-01-01",
    "year": 2022,
    "text": "test",
    "title": "test",
    "keyword": [
      "test"
    ],
    "url": "test",
    "party": "test",
    "president": "test"
  }
]

Note that amcat has automatically added an _id field (in R it is .id due to naming conventions) to the document. If we would upload the same document again, the algorithm that constructs the _id field would come up with the same value and the document would be replaced by the newly uploaded document. If we wanted to keep a duplicate for some reason, we could accomplish that by either changing at least one value in a field or by assigning an ID column manually:

new_doc <- data.frame(
  .id = "1",
  title = "test",
  text = "test",
  date = as.Date("2022-01-01"),
  president = "test",
  year = "2022",
  party = "test",
  url = "test"
)
upload_documents(index = "state_of_the_union", new_doc)

query_documents(index = "state_of_the_union", fields = NULL, filters = list(title = "test"))

Retrieved 2 results in 1 pages

# A tibble: 2 × 9
  .id          date                 year text  title keyword url   party presi…¹
  <chr>        <dttm>              <dbl> <chr> <chr> <list>  <chr> <chr> <chr>  
1 nMmfK4YBzjT… 2022-01-01 00:00:00  2022 test  test  <chr>   test  test  test   
2 1            2022-01-01 00:00:00  2022 test  test  <NULL>  test  test  test   
# … with abbreviated variable name ¹president

from datetime import datetime
new_doc = {
  "_id": "1",
  "title": "test",
  "text": "test",
  "date": datetime.strptime("2022-01-01", '%Y-%m-%d'),
  "president": "test",
  "year": "2022",
  "party": "test",
  "url": "test"
}
amcat.upload_documents("state_of_the_union", [new_doc])

res=list(amcat.query("state_of_the_union", fields=None, filters={"title": "test"}))
pp.pprint(res)

[{'_id': 'nMmfK4YBzjTKB-2wn39U',
  'date': datetime.datetime(2022, 1, 1, 0, 0),
  'keyword': ['test'],
  'party': 'test',
  'president': 'test',
  'text': 'test',
  'title': 'test',
  'url': 'test',
  'year': 2022.0},
 {'_id': '1',
  'date': datetime.datetime(2022, 1, 1, 0, 0),
  'party': 'test',
  'president': 'test',
  'text': 'test',
  'title': 'test',
  'url': 'test',
  'year': 2022.0}]

curl -s -X POST http://localhost/amcat/index/state_of_the_union/documents \
  -H "Content-Type: application/json" \
  -d '{
        "documents":[
          {
            "_id": "1",
            "title":"test",
            "text":"test",
            "date":"2022-01-01",
            "president":"test",
            "year":"2022",
            "party":"test",
            "url":"test"
          }
        ]
      }'

curl -s -X POST http://localhost/amcat/index/state_of_the_union/query \
  -H 'Content-Type: application/json' \
  -d '{"filters":{"title":["test"]}}' | jq -r ".results"

[
  {
    "_id": "nMmfK4YBzjTKB-2wn39U",
    "date": "2022-01-01",
    "year": 2022,
    "text": "test",
    "title": "test",
    "keyword": [
      "test"
    ],
    "url": "test",
    "party": "test",
    "president": "test"
  },
  {
    "_id": "1",
    "title": "test",
    "date": "2022-01-01",
    "text": "test",
    "url": "test",
    "party": "test",
    "president": "test",
    "year": 2022
  }
]

As you can see, we now have the example document in the index twice – although with different IDs. To simulate what would have happened without an ID, or rather if the ID had been constructed automatically, we can upload different data, but with the same ID to see what changes:

new_doc <- data.frame(
  .id = "1",
  title = "test",
  text = "A second test",
  date = as.Date("2022-01-02"),
  president = "test",
  year = "2022"
)
upload_documents(index = "state_of_the_union", new_doc)

query_documents(index = "state_of_the_union", fields = NULL, filters = list(title = "test"))

Retrieved 2 results in 1 pages

# A tibble: 2 × 9
  .id          date                 year text  title keyword url   party presi…¹
  <chr>        <dttm>              <dbl> <chr> <chr> <list>  <chr> <chr> <chr>  
1 nMmfK4YBzjT… 2022-01-01 00:00:00  2022 test  test  <chr>   test  test  test   
2 1            2022-01-02 00:00:00  2022 A se… test  <NULL>  <NA>  <NA>  test   
# … with abbreviated variable name ¹president

from datetime import datetime
new_doc = {
  "_id": "1",
  "title": "test",
  "text": "A second test",
  "date": datetime.strptime("2022-01-02", '%Y-%m-%d'),
  "president": "test",
  "year": "2022"
}
amcat.upload_documents("state_of_the_union", [new_doc])

res=list(amcat.query("state_of_the_union", fields=None, filters={"title": "test"}))
pp.pprint(res)

[{'_id': 'nMmfK4YBzjTKB-2wn39U',
  'date': datetime.datetime(2022, 1, 1, 0, 0),
  'keyword': ['test'],
  'party': 'test',
  'president': 'test',
  'text': 'test',
  'title': 'test',
  'url': 'test',
  'year': 2022.0},
 {'_id': '1',
  'date': datetime.datetime(2022, 1, 2, 0, 0),
  'president': 'test',
  'text': 'A second test',
  'title': 'test',
  'year': 2022.0}]

curl -s -X POST http://localhost/amcat/index/state_of_the_union/documents \
  -H "Content-Type: application/json" \
  -d '{
        "documents":[
          {
            "_id": "1",
            "title":"test",
            "text":"A second test",
            "date":"2022-01-02",
            "president":"test",
            "year":"2022"
          }
        ]
      }'

curl -s -X POST http://localhost/amcat/index/state_of_the_union/query \
  -H 'Content-Type: application/json' \
  -d '{"filters":{"title":["test"]}}' | jq -r ".results"

[
  {
    "_id": "nMmfK4YBzjTKB-2wn39U",
    "date": "2022-01-01",
    "year": 2022,
    "text": "test",
    "title": "test",
    "keyword": [
      "test"
    ],
    "url": "test",
    "party": "test",
    "president": "test"
  },
  {
    "_id": "1",
    "title": "test",
    "date": "2022-01-02",
    "text": "A second test",
    "president": "test",
    "year": 2022
  }
]

The document with the ID 1 has been replaced with the new data. This is the normal behaviour of amcat: when we tell it to add data to an already present document, identified by the ID, it will be replaced. If a field was present in the old document, but not in the data it is replaced with, this field will be empty afterwards.